Computational musicology requires high-quality data, and sampling, gathering, and curating musical data is a big part of what the CCMLab does. Below is a frequently updated list of not only our own datasets, but also numerous other musical datasets that are publically available.

CCML datasets

  • CoCoPops: Billboard
    • Melodic and harmonic transcriptions of 749 files sampled from the Billboard Hot100
    • Encoded in humdrum format.
  • CoCoPops: Rollingstone
    • Melodic and harmonic transcriptions of 200 songs from Rolling Stone's list of Greatest Songs of All Time
    • Encoded in humdrum format.
  • MCFlow
    • Transcriptions of rhythm and rhythm in 124 popular rap songs.
    • Encoded in humdrum format.
  • Star Wars Thematic Corpus
    • John Williams Themes
    • Encoded in humdrum format.
  • Regional Hip-Hop
    • An adendum to MCFlow.
    • Encoded in humdrum format.

Other datasets

The Musical Corpora Register is an excellent list of musical corpora, curated by Daniel Harasim at the École Polytechnique Fédérale de Lausanne, in Switzerland. We maintain a similar “MusoRepo” list on github as well. Another great resource is the Digital Resources for Musicology page, hosted by Stanford’s center for the Computer Assisted Research in the Humanities—don’t miss the links to their ADAM and EVE resources.

More generally, you find many humdrum datasets on github by searching the hashtags #digital-scores or #humdrum (or both). If you post your own datasets on github, please add appropriate tags, and ask to add your data to the Musical Corpora Register!