Table of Contents
Fetching ...

NotaGen: Advancing Musicality in Symbolic Music Generation with Large Language Model Training Paradigms

Yashan Wang, Shangda Wu, Jianhuai Hu, Xingjian Du, Yueqi Peng, Yongxin Huang, Shuai Fan, Xiaobing Li, Feng Yu, Maosong Sun

TL;DR

This work tackles high-quality symbolic sheet-music generation by adapting Large Language Model training paradigms to NotaGen, a model pre-trained on $1.6\mathrm{M}$ ABC notation sheets and fine-tuned on about $9{,}000$ classical pieces with period-composer-instrumentation prompts. It introduces CLaMP-DPO, a reinforcement learning approach that uses the CLaMP 2 evaluator within a Direct Preference Optimization framework to improve musicality without human labeling. Empirical results, including subjective A/B tests, show NotaGen outperforms baselines and rivals human compositions in perceived musicality, while CLaMP-DPO consistently enhances controllability and quality across modalities and encodings. This demonstrates the viability of translating LLM training paradigms to symbolic music and suggests avenues for extending to other genres and representations while addressing data scarcity and orchestration complexity.

Abstract

We introduce NotaGen, a symbolic music generation model aiming to explore the potential of producing high-quality classical sheet music. Inspired by the success of Large Language Models (LLMs), NotaGen adopts pre-training, fine-tuning, and reinforcement learning paradigms (henceforth referred to as the LLM training paradigms). It is pre-trained on 1.6M pieces of music in ABC notation, and then fine-tuned on approximately 9K high-quality classical compositions conditioned on "period-composer-instrumentation" prompts. For reinforcement learning, we propose the CLaMP-DPO method, which further enhances generation quality and controllability without requiring human annotations or predefined rewards. Our experiments demonstrate the efficacy of CLaMP-DPO in symbolic music generation models with different architectures and encoding schemes. Furthermore, subjective A/B tests show that NotaGen outperforms baseline models against human compositions, greatly advancing musical aesthetics in symbolic music generation.

NotaGen: Advancing Musicality in Symbolic Music Generation with Large Language Model Training Paradigms

TL;DR

This work tackles high-quality symbolic sheet-music generation by adapting Large Language Model training paradigms to NotaGen, a model pre-trained on ABC notation sheets and fine-tuned on about classical pieces with period-composer-instrumentation prompts. It introduces CLaMP-DPO, a reinforcement learning approach that uses the CLaMP 2 evaluator within a Direct Preference Optimization framework to improve musicality without human labeling. Empirical results, including subjective A/B tests, show NotaGen outperforms baselines and rivals human compositions in perceived musicality, while CLaMP-DPO consistently enhances controllability and quality across modalities and encodings. This demonstrates the viability of translating LLM training paradigms to symbolic music and suggests avenues for extending to other genres and representations while addressing data scarcity and orchestration complexity.

Abstract

We introduce NotaGen, a symbolic music generation model aiming to explore the potential of producing high-quality classical sheet music. Inspired by the success of Large Language Models (LLMs), NotaGen adopts pre-training, fine-tuning, and reinforcement learning paradigms (henceforth referred to as the LLM training paradigms). It is pre-trained on 1.6M pieces of music in ABC notation, and then fine-tuned on approximately 9K high-quality classical compositions conditioned on "period-composer-instrumentation" prompts. For reinforcement learning, we propose the CLaMP-DPO method, which further enhances generation quality and controllability without requiring human annotations or predefined rewards. Our experiments demonstrate the efficacy of CLaMP-DPO in symbolic music generation models with different architectures and encoding schemes. Furthermore, subjective A/B tests show that NotaGen outperforms baseline models against human compositions, greatly advancing musical aesthetics in symbolic music generation.

Paper Structure

This paper contains 25 sections, 4 equations, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: An overview of NotaGen's training paradigms.
  • Figure 3: Subjective A/B tests on musicality of generated outputs before and after CLaMP-DPO optimization. All models exhibited improvement in human-perveiced musicality after applying the CLaMP-DPO algorithm.
  • Figure 4: Subjective A/B test between model outputs and ground truth. NotaGen achieved the highest voting rate against the ground truth among the three models.
  • Figure 5: Illustration of the MIDI Event Transformer architecture, showcasing its two hierarchical decoders: the event-level decoder, which models temporal dependencies across high-level events, and the token-level decoder, which generates the detailed token sequence in an auto-regressive manner.
  • Figure : (a)
  • ...and 6 more figures