Large Language Models: From Notes to Musical Form
Lilac Atassi
TL;DR
This work tackles the difficulty of embedding long-scale musical form in ML-generated music by arguing that standard likelihood-based models struggle with high variation across pieces at large time spans. It proposes a novel approach that integrates large language models with MusicGen and EnCodec to design form-driven prompts and transitions, enabling 2.5-minute pieces with perceptual quality on par with training data. Key contributions include an analysis of unlearnable long-form structure, a latent-space autoregressive generation framework, text-conditioned generation with classifier-free guidance, and a two-phase meta-prompt optimization loop to automate form design. The results demonstrate meaningful improvements in perceived musical coherence and structure, suggesting practical potential for controlled, long-form AI-assisted composition.
Abstract
While many topics of the learning-based approach to automated music generation are under active research, musical form is under-researched. In particular, recent methods based on deep learning models generate music that, at the largest time scale, lacks any structure. In practice, music longer than one minute generated by such models is either unpleasantly repetitive or directionless. Adapting a recent music generation model, this paper proposes a novel method to generate music with form. The experimental results show that the proposed method can generate 2.5-minute-long music that is considered as pleasant as the music used to train the model. The paper first reviews a recent music generation method based on language models (transformer architecture). We discuss why learning musical form by such models is infeasible. Then we discuss our proposed method and the experiments.
