Table of Contents
Fetching ...

CHORDONOMICON: A Dataset of 666,000 Songs and their Chord Progressions

Spyridon Kantarelis, Konstantinos Thomas, Vassilis Lyberatos, Edmund Dervakos, Giorgos Stamou

TL;DR

Chordonomicon addresses the lack of large-scale, structured chord progression datasets by introducing a resource with over 666,000 songs, structured chord progressions, and graph-based representations linked to music ontologies. The authors demonstrate baseline tasks, including a GPT-2–style transformer for chord prediction and graph-kernel–based genre/decade classification, reporting meaningful yet improvable results (e.g., 60.13% chord-prediction accuracy and 40.3%/26.6% in decade/genre classification). The dataset supports integration with the Harte syntax and the Functional Harmony Ontology, enabling knowledge-graph enhancements and hybrid ML approaches, while also addressing ethical data harvesting and copyright considerations. Overall, the work provides a rich testbed for MIR, graph ML, and NLP applications, with clear directions for future improvements in modeling, tokenization, and ontology-driven enrichment.

Abstract

Chord progressions encapsulate important information about music, pertaining to its structure and conveyed emotions. They serve as the backbone of musical composition, and in many cases, they are the sole information required for a musician to play along and follow the music. Despite their importance, chord progressions as a data domain remain underexplored. There is a lack of large-scale datasets suitable for deep learning applications, and limited research exploring chord progressions as an input modality. In this work, we present Chordonomicon, a dataset of over 666,000 songs and their chord progressions, annotated with structural parts, genre, and release date - created by scraping various sources of user-generated progressions and associated metadata. We demonstrate the practical utility of the Chordonomicon dataset for classification and generation tasks, and discuss its potential to provide valuable insights to the research community. Chord progressions are unique in their ability to be represented in multiple formats (e.g. text, graph) and the wealth of information chords convey in given contexts, such as their harmonic function . These characteristics make the Chordonomicon an ideal testbed for exploring advanced machine learning techniques, including transformers, graph machine learning, and hybrid systems that combine knowledge representation and machine learning.

CHORDONOMICON: A Dataset of 666,000 Songs and their Chord Progressions

TL;DR

Chordonomicon addresses the lack of large-scale, structured chord progression datasets by introducing a resource with over 666,000 songs, structured chord progressions, and graph-based representations linked to music ontologies. The authors demonstrate baseline tasks, including a GPT-2–style transformer for chord prediction and graph-kernel–based genre/decade classification, reporting meaningful yet improvable results (e.g., 60.13% chord-prediction accuracy and 40.3%/26.6% in decade/genre classification). The dataset supports integration with the Harte syntax and the Functional Harmony Ontology, enabling knowledge-graph enhancements and hybrid ML approaches, while also addressing ethical data harvesting and copyright considerations. Overall, the work provides a rich testbed for MIR, graph ML, and NLP applications, with clear directions for future improvements in modeling, tokenization, and ontology-driven enrichment.

Abstract

Chord progressions encapsulate important information about music, pertaining to its structure and conveyed emotions. They serve as the backbone of musical composition, and in many cases, they are the sole information required for a musician to play along and follow the music. Despite their importance, chord progressions as a data domain remain underexplored. There is a lack of large-scale datasets suitable for deep learning applications, and limited research exploring chord progressions as an input modality. In this work, we present Chordonomicon, a dataset of over 666,000 songs and their chord progressions, annotated with structural parts, genre, and release date - created by scraping various sources of user-generated progressions and associated metadata. We demonstrate the practical utility of the Chordonomicon dataset for classification and generation tasks, and discuss its potential to provide valuable insights to the research community. Chord progressions are unique in their ability to be represented in multiple formats (e.g. text, graph) and the wealth of information chords convey in given contexts, such as their harmonic function . These characteristics make the Chordonomicon an ideal testbed for exploring advanced machine learning techniques, including transformers, graph machine learning, and hybrid systems that combine knowledge representation and machine learning.

Paper Structure

This paper contains 19 sections, 1 equation, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Dataset genres' distribution
  • Figure 2: Dataset parts' distribution
  • Figure 3: Graph representation of music track with ID 1
  • Figure 4: Dataset chord distribution
  • Figure 5: Cosine similarity between genres
  • ...and 2 more figures