Computational musicology requires high-quality data, and sampling, gathering, and curating musical data is a big part of what the CCMLab does. Below is a frequently updated list of not only our own datasets, but also numerous other musical datasets that are publicly available.

CCML datasets

CoCoPops: Billboard
- Melodic and harmonic transcriptions of 749 files sampled from the Billboard Hot100
- Encoded in humdrum format.
CoCoPops: Rollingstone
- Melodic and harmonic transcriptions of 200 songs from Rolling Stone's list of Greatest Songs of All Time
- Encoded in humdrum format.
MCFlow
- Transcriptions of rhythm and rhyme in 124 popular rap songs.
- Encoded in humdrum format.
Star Wars Thematic Corpus
- John Williams Themes
- Encoded in humdrum format.
Regional Hip-Hop
- An adendum to MCFlow.
- Encoded in humdrum format.
Sanidha
- Studio-quality video and source-separated audio recordings of 8 hours of Carnatic music, performed by expert musicians.
- Encoded in audio/video format.

Other datasets

The Musical Corpora Register is an excellent list of musical corpora, curated by Daniel Harasim at the École Polytechnique Fédérale de Lausanne, in Switzerland. We maintain a similar “MusoRepo” list on github as well. Another great resource is the Digital Resources for Musicology page, hosted by Stanford’s center for the Computer Assisted Research in the Humanities—don’t miss the links to their ADAM and EVE resources.

More generally, you find many humdrum datasets on github by searching the hashtags #digital-scores or #humdrum (or both). If you post your own datasets on github, please add appropriate tags, and ask to add your data to the Musical Corpora Register!

Digital Edition of Beethoven Piano Sonatas
- Digital edition of L. van Beethoven's piano sonatas in the Humdrum file format, based on the Durand 1915 edition edited by Paul Dukas.
- Encoded in humdrum format.
The McGill Billboard Project
- Harmonic transcriptions of 749 tracks from the Billboard Hot 100.
- Encoded in custom format.