Table of Contents
Fetching ...

Large Language Models: From Notes to Musical Form

Lilac Atassi

TL;DR

This work tackles the difficulty of embedding long-scale musical form in ML-generated music by arguing that standard likelihood-based models struggle with high variation across pieces at large time spans. It proposes a novel approach that integrates large language models with MusicGen and EnCodec to design form-driven prompts and transitions, enabling 2.5-minute pieces with perceptual quality on par with training data. Key contributions include an analysis of unlearnable long-form structure, a latent-space autoregressive generation framework, text-conditioned generation with classifier-free guidance, and a two-phase meta-prompt optimization loop to automate form design. The results demonstrate meaningful improvements in perceived musical coherence and structure, suggesting practical potential for controlled, long-form AI-assisted composition.

Abstract

While many topics of the learning-based approach to automated music generation are under active research, musical form is under-researched. In particular, recent methods based on deep learning models generate music that, at the largest time scale, lacks any structure. In practice, music longer than one minute generated by such models is either unpleasantly repetitive or directionless. Adapting a recent music generation model, this paper proposes a novel method to generate music with form. The experimental results show that the proposed method can generate 2.5-minute-long music that is considered as pleasant as the music used to train the model. The paper first reviews a recent music generation method based on language models (transformer architecture). We discuss why learning musical form by such models is infeasible. Then we discuss our proposed method and the experiments.

Large Language Models: From Notes to Musical Form

TL;DR

This work tackles the difficulty of embedding long-scale musical form in ML-generated music by arguing that standard likelihood-based models struggle with high variation across pieces at large time spans. It proposes a novel approach that integrates large language models with MusicGen and EnCodec to design form-driven prompts and transitions, enabling 2.5-minute pieces with perceptual quality on par with training data. Key contributions include an analysis of unlearnable long-form structure, a latent-space autoregressive generation framework, text-conditioned generation with classifier-free guidance, and a two-phase meta-prompt optimization loop to automate form design. The results demonstrate meaningful improvements in perceived musical coherence and structure, suggesting practical potential for controlled, long-form AI-assisted composition.

Abstract

While many topics of the learning-based approach to automated music generation are under active research, musical form is under-researched. In particular, recent methods based on deep learning models generate music that, at the largest time scale, lacks any structure. In practice, music longer than one minute generated by such models is either unpleasantly repetitive or directionless. Adapting a recent music generation model, this paper proposes a novel method to generate music with form. The experimental results show that the proposed method can generate 2.5-minute-long music that is considered as pleasant as the music used to train the model. The paper first reviews a recent music generation method based on language models (transformer architecture). We discuss why learning musical form by such models is infeasible. Then we discuss our proposed method and the experiments.
Paper Structure (6 sections, 4 figures, 1 table)

This paper contains 6 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Illustrating the incoherence in images generated by Dall-E, the top row images demonstrate that the generative model has learned that a mirror reflects an image, yet the incoherence in the generated images is evident. The bottom left image shows that the model can generate flags with coherence at all scales. However, the bottom right image reveals the model's struggle with coherence at large scales when generating wavering flags, with one of the discontinuities highlighted.
  • Figure 2: The PO-LLM proposes new instructions and few-shot samples for MP-LLM. MP-LLM follows the instructions and generates a set of prompts for MusicGen to generate a coherent music piece with musical form. The generated music is then rated by human evaluators and the average MOS is estimated. PO-LLM is instructed y the meta prompt to consider the previous 5 prompts with highest MOS to propose a new prompt for MP-LLM.
  • Figure 3: The average MOS of the prompts generated in the exploration phase of the optimization method.
  • Figure 4: The average MOS of the top 5 prompts in each iteration of the optimization.