TEAdapter: Supply abundant guidance for controllable text-to-music generation
Jialing Zou, Jiahao Mei, Xudong Nan, Jinghua Li, Daoguo Dong, Liang He
TL;DR
This work tackles the challenge of fine-grained controllability in text-to-music generation by introducing TEAdapter, a lightweight plugin that injects diverse control signals into a frozen diffusion backbone, enabling global, elemental, and structural conditioning. Built on AudioLDM 2 and a multi-TEAdapter framework, the method learns control-specific features $Y_{ec}$ from teacher music and fuses them into the denoising process through a weighted sum $Y = \sum_i w_i Y_{ec}^i$, with the denoising step updated as $\\hat{\\epsilon}_t = U(z_t, t, Y, Y_{ec}; \\theta)$. The training objective $L_{AD}$ reinforces accurate epsilon prediction while permitting scalable, transferable control across chord, melody, instrument timbre, and long-form structure via inpainting of junctions between segments (intro/chorus/outro). Experimental results show improvements in melody accuracy, beat stability, and perceptual quality (FAD, CLAP) relative to strong baselines, and demonstrate effective long-form generation through structural TEAdapter groups. Overall, TEAdapter offers a practical, modular path to more controllable, extended music generation with reduced training costs and broad compatibility with diffusion architectures.
Abstract
Although current text-guided music generation technology can cope with simple creative scenarios, achieving fine-grained control over individual text-modality conditions remains challenging as user demands become more intricate. Accordingly, we introduce the TEAcher Adapter (TEAdapter), a compact plugin designed to guide the generation process with diverse control information provided by users. In addition, we explore the controllable generation of extended music by leveraging TEAdapter control groups trained on data of distinct structural functionalities. In general, we consider controls over global, elemental, and structural levels. Experimental results demonstrate that the proposed TEAdapter enables multiple precise controls and ensures high-quality music generation. Our module is also lightweight and transferable to any diffusion model architecture. Available code and demos will be found soon at https://github.com/Ashley1101/TEAdapter.
