Content-based Controls For Music Large Language Modeling

Liwei Lin; Gus Xia; Junyan Jiang; Yixiao Zhang

Content-based Controls For Music Large Language Modeling

Liwei Lin, Gus Xia, Junyan Jiang, Yixiao Zhang

TL;DR

The paper tackles the limited expressiveness of text-only controls in music generation by introducing Coco-Mulla, a content-based control framework built on a parameter-efficient adaptor for a pre-trained music language model. It develops a joint symbolic-acoustic embedding and a time-aware condition adaptor that injects content-based cues (chords, MIDI, and drum textures) into the last layers of a frozen base model, enabling multi-modal control with minimal trainable parameters. The approach achieves effective chord and rhythm control, supports variation and arrangement when combined with text prompts, and demonstrates strong performance in low-resource fine-tuning using pseudo-labeled data from a small corpus. This method broadens practical music editing and composition workflows by enabling direct control over harmonic and rhythmic content without retraining large models, while highlighting areas for handling conflicts between content-based cues and textual guidance.

Abstract

Recent years have witnessed a rapid growth of large-scale language models in the domain of music audio. Such models enable end-to-end generation of higher-quality music, and some allow conditioned generation using text descriptions. However, the control power of text controls on music is intrinsically limited, as they can only describe music indirectly through meta-data (such as singers and instruments) or high-level representations (such as genre and emotion). We aim to further equip the models with direct and content-based controls on innate music languages such as pitch, chords and drum track. To this end, we contribute Coco-Mulla, a content-based control method for music large language modeling. It uses a parameter-efficient fine-tuning (PEFT) method tailored for Transformer-based audio models. Experiments show that our approach achieved high-quality music generation with low-resource semi-supervised learning, tuning with less than 4% parameters compared to the original model and training on a small dataset with fewer than 300 songs. Moreover, our approach enables effective content-based controls, and we illustrate the control power via chords and rhythms, two of the most salient features of music audio. Furthermore, we show that by combining content-based controls and text descriptions, our system achieves flexible music variation generation and arrangement. Our source codes and demos are available online.

Content-based Controls For Music Large Language Modeling

TL;DR

Abstract

Paper Structure (21 sections, 13 equations, 5 figures, 3 tables)

This paper contains 21 sections, 13 equations, 5 figures, 3 tables.

Introduction
Related work
Music Audio Generation
Parameter-Efficient Fine-Tuning
Base Model
Methodology
Joint Symbolic and Acoustic Embedding
Symbolic Chord and MIDI Representation
Acoustic Representation
Masking Scheme and Positional Encoding
Condition Adaptor
Experiment
Datasets
Training Configuration
Evaluation
...and 6 more sections

Figures (5)

Figure 1: The joint embedding module. We randomly mask acoustic or piano roll embedding with probability $r$ during training.
Figure 2: Condition adaptor. The condition prefix is injected to the self-attention mechanism of the MusicGen transformer decoder. All transformation matrices in MusicGen are frozen. Only the input embeddings, joint embedding encoders, and the gate factors are trainable.
Figure 3: Comparison of generated samples and groundtruth. The top two rows are generated samples, while the bottom rows are reference soundtracks. The text prompt is "lazy jazz composition features a captivating saxophone solo that effortlessly melds with piano chords, skillfully weaving its way through the melody with languid grace. Instruments: saxophone, piano, drums".
Figure 4: The variation of $|g_l|$ during training.
Figure :

Content-based Controls For Music Large Language Modeling

TL;DR

Abstract

Content-based Controls For Music Large Language Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (5)