Table of Contents
Fetching ...

Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

Seungyeon Rhyu, Kichang Yang, Sungjun Cho, Jaehyeon Kim, Kyogu Lee, Moontae Lee

TL;DR

This work addresses the difficulty of encoding musical structure in symbolic music generation without domain annotations. It builds a GPT-2 based framework that injects MuseNet-inspired structural embeddings—Part, Type, Time, and Pitch-Class—into MIDI tokens, with two initialization strategies explored for each embedding. Through objective metrics (Structureness Indicator, CPVR, CPI) and subjective A/B tests, the authors reveal trade-offs: sinusoidal initialization yields stronger repetition and common chords, while random initialization enhances perceived naturalness and prompt fidelity but can be unstable. The study provides practical, reproducible guidelines and open-source tooling for researchers and developers aiming to deploy annotation-free symbolic music generation with large language models.

Abstract

Music generation introduces challenging complexities to large language models. Symbolic structures of music often include vertical harmonization as well as horizontal counterpoint, urging various adaptations and enhancements for large-scale Transformers. However, existing works share three major drawbacks: 1) their tokenization requires domain-specific annotations, such as bars and beats, that are typically missing in raw MIDI data; 2) the pure impact of enhancing token embedding methods is hardly examined without domain-specific annotations; and 3) existing works to overcome the aforementioned drawbacks, such as MuseNet, lack reproducibility. To tackle such limitations, we develop a MIDI-based music generation framework inspired by MuseNet, empirically studying two structural embeddings that do not rely on domain-specific annotations. We provide various metrics and insights that can guide suitable encoding to deploy. We also verify that multiple embedding configurations can selectively boost certain musical aspects. By providing open-source implementations via HuggingFace, our findings shed light on leveraging large language models toward practical and reproducible music generation.

Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

TL;DR

This work addresses the difficulty of encoding musical structure in symbolic music generation without domain annotations. It builds a GPT-2 based framework that injects MuseNet-inspired structural embeddings—Part, Type, Time, and Pitch-Class—into MIDI tokens, with two initialization strategies explored for each embedding. Through objective metrics (Structureness Indicator, CPVR, CPI) and subjective A/B tests, the authors reveal trade-offs: sinusoidal initialization yields stronger repetition and common chords, while random initialization enhances perceived naturalness and prompt fidelity but can be unstable. The study provides practical, reproducible guidelines and open-source tooling for researchers and developers aiming to deploy annotation-free symbolic music generation with large language models.

Abstract

Music generation introduces challenging complexities to large language models. Symbolic structures of music often include vertical harmonization as well as horizontal counterpoint, urging various adaptations and enhancements for large-scale Transformers. However, existing works share three major drawbacks: 1) their tokenization requires domain-specific annotations, such as bars and beats, that are typically missing in raw MIDI data; 2) the pure impact of enhancing token embedding methods is hardly examined without domain-specific annotations; and 3) existing works to overcome the aforementioned drawbacks, such as MuseNet, lack reproducibility. To tackle such limitations, we develop a MIDI-based music generation framework inspired by MuseNet, empirically studying two structural embeddings that do not rely on domain-specific annotations. We provide various metrics and insights that can guide suitable encoding to deploy. We also verify that multiple embedding configurations can selectively boost certain musical aspects. By providing open-source implementations via HuggingFace, our findings shed light on leveraging large language models toward practical and reproducible music generation.
Paper Structure (17 sections, 2 equations, 6 figures, 4 tables)

This paper contains 17 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Example of beat detection results using the famous Madmom algorithmbock2016madmom on (a) POP909 and (b) GiantMIDI. The blue and red dotted lines indicate ground truth and predicted beats, respectively.
  • Figure 2: Illustration of the reconstructed structural embeddings. Top: An example of a piano roll where 5 notes appear in different pitches and timings. Bottom: The corresponding tokens and structural embeddings.
  • Figure 3: Illustration of our music generation framework. Left: During training, four types of structural embeddings are concatenated to the input tokens, then projected to the input embedding size for next-token prediction. Right: During inference, structural information inferred via rule-based modules is used for autoregressive generation.
  • Figure 4: Correlation results among SI, CPVR, and CPI computed from the three metric scores of all models. Pearson's correlation coefficients are also reported on the top of each plot.
  • Figure 5: PNSR vs. SI, CPVR, and CPI plots with prompt lengths 64 (solid) and 16 (dashed) from the three methods.
  • ...and 1 more figures