A Statistical Model for Melody Reduction

  • by

Tianxue (Tess) Hu, Claire Arthur


A commonly-cited reason for the poor performance of automatic chord estimation (ACE) systems within music information retrieval (MIR) is that non-chord tones (i.e., notes outside the supporting harmony) contribute to error during the labeling process. Despite the prevalence of machine-learning in MIR, there are cases where rule-based or statistical approaches provide a simpler alternative while allowing for insights into musicological practices. Conversely, deep-learning approaches are known as a “black box” where insights into the algorithm’s predictive processes are unavailable.

In this project, we present a statistical model for predicting chord tones (CTs). Our model is currently focused on predicting chord tones in classical music, since composition in this style is highly constrained, theoretically making the placement of chord tones highly predictable. Indeed, music theorists have labeling systems for every variety of non-chord tone (NCTs), primarily classified by the note’s metric position and intervals of approach and departure. By using meter, duration, and melodic intervals as predictors, we build a theory-driven model for predicting chord tones on the complete TAVERN dataset. While our probabilistic approach is similar to other efforts in the domain of automatic harmonic analysis and ACE, our focus is on melodic reduction rather than predicting harmony. However, we are working towards applications for our model in assisting ACE. Our model is currently performing non-chord tone reduction with 73.5% accuracy across multiple datasets. We believe a chord-tone model could provide a reduction tool with many applications for computational and cognitive musicology as well as music theory pedagogy. In addition, we implement our melody-reduction model as a visualization program to assist melody reduction and non-chord tone identification for computational musicology researchers and music theorists.


Data Collection and Pre-processing

Dataset: Theme and Variation Encoding with Roman Numerals (TAVERN) corpus

  • 27 sets of themes and variations for piano by Mozart and Beethoven in .krn
  • Full scores + Roman Numerals + functional interpretations
  • Arthur (2016) experimented with only themes from the TAVERN dataset


  • Extract melody from the complete TAVERN dataset
  • Extract musical features for each note
  • Label ground truth (CT or NCT) from scale degree, associated Roman Numeral, and the key
    • 45,299 total number of data points, 72% CTs and 28% NCTs in complete TAVERN.
    • 19% of NCTs in themes (Arthur, 2016)

Model Fitting

We proposed a logistic regression model because the dependent variable—CT or NCT—is binary. Since Arthur’s NCT model was only trained on the themes from the TAVERN dataset (Arthur, 2016), we wished to explore a model’s performance trained on the whole dataset.

Independent Variable:

  • CT or NCT (binary)

Dependent Variables:

  • Duration
  • On-beat or off-beat
  • Arriving Interval (a Step or a leap)
  • Departure Interval (a Step or a leap)


After splitting the complete TAVERN dataset into 80% training and 20% testing, we have the baseline accuracy = 70.30%, which is the percentage of CTs in the test set (identifies every single note as a CT).

Model 0: Logistic Regression with main effects

Main Effects
Departure Interval(DI)
Arriving Interval(AI)
  • Accuracy on the test set: 75.33%
  • Area Under the ROC curve (AUC): 0.78

The AUC score in this case reflects the model’s ability to distinguish between CT and NCT, where AUC = 1 represents perfect classification accuracy, while AUC = 0.5 implies zero discrimination ability. Since the testing accuracy will presumably increase when considering interactions between independent variables, we then applied a forward stepwise logistic regression until achieving the best AIC score, to find the variables and interactions that significantly impact the model’s performance (Model 1) .

Model 1: Forward stepwise logistic regression until best AIC

The significant factors of model1 are

Main EffectsInteractions
Departure Interval(DI)AI:Dur
Arriving Interval(AI)DI:AI
 DI:Dur DI:Beat DI:AI:Beat  
  • Accuracy on the test set: 75.34%
  • AUC: 0.79
  • AIC: 38471.22

Our next goal was to attempt to achieve an equal or higher testing accuracy using a simpler model. However, testing accuracy of Model 1 failed to improve as expected. A likely reason is the substantial noisiness in the variations as described earlier. Accordingly, we attempted to replicate Arthur’s (2016) approach of training with only the themes (but testing on the full dataset). The logic here was that “rules” are more likely to be followed in simple themes but difficult to learn from the full variations.

Model 2: Forward stepwise logistic regression until best AIC with themes

Main EffectsInteractions
Departure Interval(DI)AI:Beat
Arriving Interval(AI)DI:Beat
  • Accuracy (on the same test set as Model 0 and Model 1): 75.39%
  • AUC: 0.78
  • AIC: 1202.98

Model 2 successfully replicates Arthur’s findings (Arthur, 2016). With themes as the training data, though the accuracy and AUC did not significantly improve, the model is much simpler with a substantially lower AIC. In other words, we achieved the same accuracy with a simpler model. As a result, we take Model 2 as the best model.



Cross-validation is applied to test the our model’s ability to classify Cts and NCTs for any new data, we wanted to test Model 2‘s performance on random subsets of the themes portion of the dataset itself. The themes dataset has 2,039 data points, with 81% CT and 19% NCT. Using an 80%/20% split training and testing, we performed 10-fold cross-validation on the themes data, the results of this test of Model 2 are shown in Table below.

Percentage of chord tones in the training set: 83.19%
Cross-validation: Accuracy: 84.87% Precision: 0.8756 Recall: 0.9508 f1: 0.9116 AUC: 0.8747

The results indicate that the model does not overfit or underfit, and makes sufficient predictions on the themes dataset itself. Of course, it is still biased due to the extreme proportion of CTs.

Test on other datasets

Previously we tested the model on only a part of the TAVERN dataset (i.e., a fixed 20% test set). Since the melodic complexity can vary significantly across different variations, we use cross-validation to re-test the model on the complete TAVERN dataset by randomly selecting 2,400 data points over 50 separate trials and take the average over all trials. In addition to TAVERN, we test Model 2 on another dataset with harmonic annotations of Joseph Haydn’s “Sun Quartets” dataset in kern format (López, 2017), which contains 6 string quartets (12,616 data points). The Model 2 results tested on the two datasets (full TAVERN and Sun Quartets) are displayed in the table below. The prediction accuracy is higher than the baseline model in both cases, and Model 2 is capable of distinguishing CTs and NCTs in the melodies from the two datasets.

Model 276.4%70.6%

Visualization in Verovio Humdrum Viewer

We also present an application of the model to a score visualization program, using color to allow easy identification of notes predicted as NCTs in the score. We use the Verovio Humdrum Viewer (VHV) as our visualization platform, which is an online symbolic music editor and interactive notation rendering interface for .krn and .mei files (Sapp, 2017). The visual implementation of the model in VHV provides an intuitive way to recognize NCTs in a melody or consider it as another way to evaluate our model’s performance. The visualization would also assist computational musicologists and music theorists with music analysis.

Figure 1: Visualization of a phrase in the TAVERN dataset (with harmony labeled), where black notes are chord tones and pink notes are non-chord tones.

A Statistical Model for Melody Reduction


This paper proposed a logistic regression model to predict non-chord tones (NCTs) from a single melodic line from a classical score. The best model, trained on the themes of the TAVERN dataset, used several melodic features as NCT predictors (i.e., duration, beat, arriving and departure intervals). The model can successfully distinguish CTs and NCTs across multiple datasets with an average accuracy of about 73.5%. Finally, we introduced a visualization tool using the model, which performs melodic reduction on a monophonic symbolic score for computational musicology researchers and music theorists.

Paper link: https://drive.google.com/file/d/1r4Z1IT7zQqdNg8u7JYNe-WzKEMCHPMG6/view

Presentation at Future Directions of Music Cognition: https://osf.io/326ch/


Arthur, C. A Corpus Approach to the Classification of Non-chord Tones Across Genres. In 14th International Conference on Music Perception and Cognition, San Francisco, CA, United States.

Charyton, C. (Ed.). (2015). Creativity and innovation among science and art: A discussion of the two cultures. Springer, London.

Chen, T. P., & Su, L. (2018). Functional Harmony Recognition of Symbolic Music Data with Multi-task Recurrent Neural Networks. In International Society for Music Information Retrieval Conference (ISMIR), Paris, France (pp. 90–97).

Condit-Schultz, N., Ju, Y., & Fujinaga, I. (2018). A Flexible Approach to Automated Harmonic Analysis: Multiple Annotations of Chorales by Bach and Prætorius. In International Society for Music Information Retrieval Conference (ISMIR), Paris, France (pp. 66-73).

Devaney, J., Arthur, C., Condit-Schultz, N., & Nisula, K. (2015, October). Theme and variation encodings with roman numerals (TAVERN): A new data set for symbolic music analysis. In International Society for Music Information Retrieval Conference (ISMIR), Malaga, Spain.

Giraud, M., Déguernel, K., & Cambouropoulos, E. (2013, October). Fragmentations with pitch, rhythm and parallelism constraints for variation matching. In International Symposium on Computer Music Multidisciplinary Research (pp. 298-312). Springer, Cham.

Huron, D. (1995). The Humdrum Toolkit: Reference manual. Menlo Park, California: Center for Computer Assisted Research in the Humanities.

Klapuri, A. P. (2004). Automatic music transcription as we know it today. Journal of New Music Research, 33(3), 269-282.

Koops, H. V., de Haas, W. B., Bransen, J., & Volk, A. (2020). Automatic chord label personalization through deep learning of shared harmonic interval profiles. Neural Computing and Applications, 32(4), 929-939.

Kostka, S., & Payne, D. (2013). Tonal harmony. McGraw-Hill Higher Education.

Masada, K., & Bunescu, R. C. (2017). Chord Recognition in Symbolic Music Using Semi-Markov Conditional Random Fields. In International Society for Music Information Retrieval Conference (ISMIR), Suzhou, China.

Salzer, F. (1949). The Technique of Variation; A Study of the Instrumental Variation from Antonio de Cabezón to Max Reger. University of California.

Sapp, C.S. (2017, Mar 4)  Verovio Humdrum Viewer Documentation.   https://doc.verovio.humdrum.org/.

López, N. N. (2017). Automatic harmonic analysis of classical string quartets from symbolic score (Doctoral dissertation, Master’s thesis, Universitat Pompeu Fabra). https://doi.org/10.5281/zenodo.1095617

Radicioni, D. P., & Esposito, R. (2010). BREVE: an HMPerceptron-based chord recognition system. In Advances in Music Information Retrieval (pp. 143-164). Springer, Berlin, Heidelberg.

Temperley, D. (1997). An algorithm for harmonic analysis. Music Perception, 15(1), 31-68.

Temperley, D., & Sleator, D. (1999). Modeling meter and harmony: A preference-rule approach. Computer Music Journal, 23(1), 10-27.