Table of Contents
Fetching ...

MuPT: A Generative Symbolic Music Pretrained Transformer

Xingwei Qu, Yuelin Bai, Yinghao Ma, Ziya Zhou, Ka Man Lo, Jiaheng Liu, Ruibin Yuan, Lejun Min, Xueling Liu, Tianyu Zhang, Xinrun Du, Shuyue Guo, Yiming Liang, Yizhi Li, Shangda Wu, Junting Zhou, Tianyu Zheng, Ziyang Ma, Fengze Han, Wei Xue, Gus Xia, Emmanouil Benetos, Xiang Yue, Chenghua Lin, Xu Tan, Stephen W. Huang, Jie Fu, Ge Zhang

TL;DR

MuPT investigates training long-range symbolic music models using ABC notation, introducing SMT-ABC to synchronize multi-track bars and the Symbolic Music Scaling (SMS) Law to guide data- and compute-aware scaling. It deploys decoder-only Transformers with 8192-token context trained on 33.6B ABC tokens, and demonstrates that SMS Law better predicts training loss and guides scaling under compute constraints. Empirical results show MuPT achieves strong structural repetition, high intra-track similarity, and favorable subjective judgments compared with baselines like GPT-4 and MMT, while enabling open access to intermediate checkpoints. The contributions provide a framework for scalable, open-symbolic-music foundation models and invite community-driven research via SMT-ABC and SMS Law tooling.

Abstract

In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a Synchronized Multi-Track ABC Notation (SMT-ABC Notation), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set. Furthermore, we explore the implications of the Symbolic Music Scaling Law (SMS Law) on model performance. The results indicate a promising direction for future research in music generation, offering extensive resources for community-led research through our open-source contributions.

MuPT: A Generative Symbolic Music Pretrained Transformer

TL;DR

MuPT investigates training long-range symbolic music models using ABC notation, introducing SMT-ABC to synchronize multi-track bars and the Symbolic Music Scaling (SMS) Law to guide data- and compute-aware scaling. It deploys decoder-only Transformers with 8192-token context trained on 33.6B ABC tokens, and demonstrates that SMS Law better predicts training loss and guides scaling under compute constraints. Empirical results show MuPT achieves strong structural repetition, high intra-track similarity, and favorable subjective judgments compared with baselines like GPT-4 and MMT, while enabling open access to intermediate checkpoints. The contributions provide a framework for scalable, open-symbolic-music foundation models and invite community-driven research via SMT-ABC and SMS Law tooling.

Abstract

In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a Synchronized Multi-Track ABC Notation (SMT-ABC Notation), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set. Furthermore, we explore the implications of the Symbolic Music Scaling Law (SMS Law) on model performance. The results indicate a promising direction for future research in music generation, offering extensive resources for community-led research through our open-source contributions.
Paper Structure (39 sections, 24 equations, 7 figures, 8 tables)

This paper contains 39 sections, 24 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Example of a multi-track tune of ABC notation.
  • Figure 2: Illustration of synchronized multiple-track ABC notation. Music segments from bars sharing the same index across all tracks, along with their right bar lines, are concatenated to guarantee alignment. The combined elements are then enclosed by a pair of a newly introduced symbol "<$|$>".
  • Figure 3: Chinchilla Law prediction and loss survey in the setting with 2.1B unique tokens.
  • Figure 4: Training Loss for different model sizes and training strategy.
  • Figure 5: The loss curve, Chinchilla prediction, and Equation\ref{['eq-D"']} on 2.1B, 8.4B and 33.6B training data.
  • ...and 2 more figures