Audio Prompt Adapter: Unleashing Music Editing Abilities for Text-to-Music with Lightweight Finetuning
Fang-Duo Tsai, Shih-Lun Wu, Haven Kim, Bo-Yu Chen, Hao-Chung Cheng, Yi-Hsuan Yang
TL;DR
AP-Adapter tackles the problem of editing existing music with text prompts by introducing a lightweight 22M-parameter module that fuses audio features from AudioMAE into a pre-trained text-to-music diffusion model (AudioLDM2). It leverages decoupled cross-attention adapters to preserve fidelity to the input while enabling targeted edits specified by text, and uses a tunable pooling mechanism to balance fidelity and transferability. The authors demonstrate effectiveness on timbre transfer, genre transfer, and accompaniment generation, including out-of-domain instruments, and show favorable results against baselines in objective metrics and subjective listening tests. This approach offers a practical route to controllable, high-quality music editing with modest computational resources and data requirements, enabling broader accessibility and potential extension to other generative backbones.
Abstract
Text-to-music models allow users to generate nearly realistic musical audio with textual commands. However, editing music audios remains challenging due to the conflicting desiderata of performing fine-grained alterations on the audio while maintaining a simple user interface. To address this challenge, we propose Audio Prompt Adapter (or AP-Adapter), a lightweight addition to pretrained text-to-music models. We utilize AudioMAE to extract features from the input audio, and construct attention-based adapters to feedthese features into the internal layers of AudioLDM2, a diffusion-based text-to-music model. With 22M trainable parameters, AP-Adapter empowers users to harness both global (e.g., genre and timbre) and local (e.g., melody) aspects of music, using the original audio and a short text as inputs. Through objective and subjective studies, we evaluate AP-Adapter on three tasks: timbre transfer, genre transfer, and accompaniment generation. Additionally, we demonstrate its effectiveness on out-of-domain audios containing unseen instruments during training.
