Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls

Liwei Lin; Gus Xia; Yixiao Zhang; Junyan Jiang

Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls

Liwei Lin, Gus Xia, Yixiao Zhang, Junyan Jiang

TL;DR

The paper tackles the gap in long-range, controllable music editing for autoregressive models by introducing AIRGen, a parameter-efficient heterogeneous adapter that turns MusicGen into a masked LM capable of inpainting, arrangement, and refinement. It combines a novel four-adapter per layer design with a masking training scheme and frame-level content-based controls to enable drum conditioning, chord progressions, and piano-cover conditioning, while keeping the majority of the base model frozen. Experiments on Slakh2100 and RWC-POP100 demonstrate competitive inpainting quality, strong steerability, and efficient fine-tuning with lightweight adapters, including favorable long-gap performance and robustness to masking patterns. The work advances practical, steerable long-term music editing with reduced computational costs and lays groundwork for richer, content-based control in AI-driven music tools.

Abstract

Controllable music generation plays a vital role in human-AI music co-creation. While Large Language Models (LLMs) have shown promise in generating high-quality music, their focus on autoregressive generation limits their utility in music editing tasks. To address this gap, we propose a novel approach leveraging a parameter-efficient heterogeneous adapter combined with a masking training scheme. This approach enables autoregressive language models to seamlessly address music inpainting tasks. Additionally, our method integrates frame-level content-based controls, facilitating track-conditioned music refinement and score-conditioned music arrangement. We apply this method to fine-tune MusicGen, a leading autoregressive music generation model. Our experiments demonstrate promising results across multiple music editing tasks, offering more flexible controls for future AI-driven music editing tools. The source codes and a demo page showcasing our work are available at https://kikyo-16.github.io/AIR.

Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls

TL;DR

Abstract

Paper Structure (24 sections, 7 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 7 equations, 7 figures, 4 tables, 1 algorithm.

Introduction
Related Work
General Audio and Music Generation Models
Music Inpainting
Parameter-Efficient Fine-tuning
Methodology
MusicGen
Content-Based Controls
Tokenization Design
Heterogeneous Adapter
Experiments
Dataset
Training
Evaluation
Baselines
...and 9 more sections

Figures (7)

Figure 1: Different music editing tasks accomplished by the model. Light green blocks with pen icons denote the masked parts to generate. (a) To inpaint an entire segment; (b) To refine some tracks conditioned on other tracks (e.g., drums); (c) To arrange the target segment following user-provided piano cover or chord controls.
Figure 2: Design of the input sequence, with $T=5$ as an example. The prefix area is followed by the prediction area. The prefix area spans from $1$ to $T$. Here the unmasked locations contain the infilling contexts. At the masked locations, the ground-truth audio is masked away and replaced by the frame-level condition audio track. The prediction area spans from $T+1$ to $2 \times T$. In this example, the masks happen to be contiguous, while in general an arbitrary set of frames can be masked.
Figure 3: The attention mask in heterogeneous adapters. The top-right block in the attention mask matrix indicates the cross-attention between learnable adapters and input tokens, where a specific group of adapters associates with a specific type of input tokens. The bottom-right block of the attention mask matrix is a general causal mask, regardless of the distribution of the masked tokens in the prefix/prediction area. In this example, the masked tokens are contiguous; however, in general an arbitrary set of frames can be masked.
Figure 4: The three types of masks used in evaluation. The number of masks is by design and the location is random.
Figure 5: Inpainting with various controls. The spectrogram conceptually repeats the same music segment twice, with the left half ($1,...,T$) providing the model with the infilling context and the controls, and the right half ($T+1,...,2T$) for the model to predict. Specifically, in the left half, outside the blue rectangle is the unmasked infilling context, and inside the blue rectangle is the separated/synthesized/user-specified audio representation of the condition (e.g., (a) drum, (b) block chords, (c) piano cover). In the right half, the orange rectangle highlights the inpainted results.
...and 2 more figures

Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls

TL;DR

Abstract

Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls

Authors

TL;DR

Abstract

Table of Contents

Figures (7)