TEAdapter: Supply abundant guidance for controllable text-to-music generation

Jialing Zou; Jiahao Mei; Xudong Nan; Jinghua Li; Daoguo Dong; Liang He

TEAdapter: Supply abundant guidance for controllable text-to-music generation

Jialing Zou, Jiahao Mei, Xudong Nan, Jinghua Li, Daoguo Dong, Liang He

TL;DR

This work tackles the challenge of fine-grained controllability in text-to-music generation by introducing TEAdapter, a lightweight plugin that injects diverse control signals into a frozen diffusion backbone, enabling global, elemental, and structural conditioning. Built on AudioLDM 2 and a multi-TEAdapter framework, the method learns control-specific features $Y_{ec}$ from teacher music and fuses them into the denoising process through a weighted sum $Y = \sum_i w_i Y_{ec}^i$, with the denoising step updated as $\\hat{\\epsilon}_t = U(z_t, t, Y, Y_{ec}; \\theta)$. The training objective $L_{AD}$ reinforces accurate epsilon prediction while permitting scalable, transferable control across chord, melody, instrument timbre, and long-form structure via inpainting of junctions between segments (intro/chorus/outro). Experimental results show improvements in melody accuracy, beat stability, and perceptual quality (FAD, CLAP) relative to strong baselines, and demonstrate effective long-form generation through structural TEAdapter groups. Overall, TEAdapter offers a practical, modular path to more controllable, extended music generation with reduced training costs and broad compatibility with diffusion architectures.

Abstract

Although current text-guided music generation technology can cope with simple creative scenarios, achieving fine-grained control over individual text-modality conditions remains challenging as user demands become more intricate. Accordingly, we introduce the TEAcher Adapter (TEAdapter), a compact plugin designed to guide the generation process with diverse control information provided by users. In addition, we explore the controllable generation of extended music by leveraging TEAdapter control groups trained on data of distinct structural functionalities. In general, we consider controls over global, elemental, and structural levels. Experimental results demonstrate that the proposed TEAdapter enables multiple precise controls and ensures high-quality music generation. Our module is also lightweight and transferable to any diffusion model architecture. Available code and demos will be found soon at https://github.com/Ashley1101/TEAdapter.

TEAdapter: Supply abundant guidance for controllable text-to-music generation

TL;DR

from teacher music and fuses them into the denoising process through a weighted sum

, with the denoising step updated as

. The training objective

reinforces accurate epsilon prediction while permitting scalable, transferable control across chord, melody, instrument timbre, and long-form structure via inpainting of junctions between segments (intro/chorus/outro). Experimental results show improvements in melody accuracy, beat stability, and perceptual quality (FAD, CLAP) relative to strong baselines, and demonstrate effective long-form generation through structural TEAdapter groups. Overall, TEAdapter offers a practical, modular path to more controllable, extended music generation with reduced training costs and broad compatibility with diffusion architectures.

Abstract

Paper Structure (16 sections, 3 equations, 4 figures, 2 tables)

This paper contains 16 sections, 3 equations, 4 figures, 2 tables.

Introduction
Related Work
Automated Music Generation
Fine-grained Controllable Generation
Methodology
Preliminary: AudioLDM 2
Method: TEAdapter
Problem Formulation
Elemental Control
Structural Control
Global Control
EXPERIMENT
Experimental Setup
Experimental Results
Conclusion
...and 1 more sections

Figures (4)

Figure 1: The overall architecture comprises a frozen music diffusion model and multiple TEAdapter groups. The music diffusion model accepts free text with additional global labels as inputs. Each TEAdapter receives different features extracted from "teacher music", and can be combined into a group with corresponding weight parameter ($\omega$). The switch section determines which TEAdapter-group is used. The schematic at the bottom illustrates inpainting process. The white sliding window represents the input for each inpainting iteration, and the patches indicate the re-generated portions.
Figure 2: Visualization of (1) the original Chromagram on a 10-second pop music clip, (2) corresponding melody extraction result with filter operation and (3) without filter operation.
Figure 3: (1) The red dashed line represents the positions of beats estimated by PLP. It can be observed that employing chord guidance yields better rhythm stability compared to not using chord guidance. (2) Compared to not using melody guidance, the melody of the results generated with melody guidance is closer to the reference music.
Figure 4: Spectrograms visualization of the structural control generated music. Noticeable splicing artifacts can be observed at the 10- and 20-second positions in the first subplot. In contrast, the inpainting result exhibits improved continuity and consistency.

TEAdapter: Supply abundant guidance for controllable text-to-music generation

TL;DR

Abstract

TEAdapter: Supply abundant guidance for controllable text-to-music generation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)