Audio Prompt Adapter: Unleashing Music Editing Abilities for Text-to-Music with Lightweight Finetuning

Fang-Duo Tsai; Shih-Lun Wu; Haven Kim; Bo-Yu Chen; Hao-Chung Cheng; Yi-Hsuan Yang

Audio Prompt Adapter: Unleashing Music Editing Abilities for Text-to-Music with Lightweight Finetuning

Fang-Duo Tsai, Shih-Lun Wu, Haven Kim, Bo-Yu Chen, Hao-Chung Cheng, Yi-Hsuan Yang

TL;DR

AP-Adapter tackles the problem of editing existing music with text prompts by introducing a lightweight 22M-parameter module that fuses audio features from AudioMAE into a pre-trained text-to-music diffusion model (AudioLDM2). It leverages decoupled cross-attention adapters to preserve fidelity to the input while enabling targeted edits specified by text, and uses a tunable pooling mechanism to balance fidelity and transferability. The authors demonstrate effectiveness on timbre transfer, genre transfer, and accompaniment generation, including out-of-domain instruments, and show favorable results against baselines in objective metrics and subjective listening tests. This approach offers a practical route to controllable, high-quality music editing with modest computational resources and data requirements, enabling broader accessibility and potential extension to other generative backbones.

Abstract

Text-to-music models allow users to generate nearly realistic musical audio with textual commands. However, editing music audios remains challenging due to the conflicting desiderata of performing fine-grained alterations on the audio while maintaining a simple user interface. To address this challenge, we propose Audio Prompt Adapter (or AP-Adapter), a lightweight addition to pretrained text-to-music models. We utilize AudioMAE to extract features from the input audio, and construct attention-based adapters to feedthese features into the internal layers of AudioLDM2, a diffusion-based text-to-music model. With 22M trainable parameters, AP-Adapter empowers users to harness both global (e.g., genre and timbre) and local (e.g., melody) aspects of music, using the original audio and a short text as inputs. Through objective and subjective studies, we evaluate AP-Adapter on three tasks: timbre transfer, genre transfer, and accompaniment generation. Additionally, we demonstrate its effectiveness on out-of-domain audios containing unseen instruments during training.

Audio Prompt Adapter: Unleashing Music Editing Abilities for Text-to-Music with Lightweight Finetuning

TL;DR

Abstract

Paper Structure (24 sections, 11 equations, 3 figures, 2 tables)

This paper contains 24 sections, 11 equations, 3 figures, 2 tables.

Introduction
Related Work
Background
Diffusion Model
AudioLDM2
Classifier-free Guidance
Proposed Audio Prompt Adapter
Audio Encoder and Feature Pooling
Decoupled Cross-attention Adapters
Training
Inference
Experiment Setup
Dataset Preparation
Evaluation Tasks
Training and Inference Specifics
...and 9 more sections

Figures (3)

Figure 1: Our AP-Adapter is an add-on to AudioLDM2 liu2023audioldm2. Users provide an original audio to AudioMAE huang2022masked to extract audio features, and an editing command to the text encoder. The decoupled audio and text cross-attention layers of AP-Adapter contribute to the fidelity with the input audio and transferability of the editing command in the edited audio.
Figure 2: Transferability-fidelity tradeoff effects of different hyperparameters on the timbre transfer task. The hyperparameters are set to $\omega$ = 2, $\alpha$ = 0.55, and $\lambda$ = 7.5 when they are not the hyperparameter of interest.
Figure :

Audio Prompt Adapter: Unleashing Music Editing Abilities for Text-to-Music with Lightweight Finetuning

TL;DR

Abstract

Audio Prompt Adapter: Unleashing Music Editing Abilities for Text-to-Music with Lightweight Finetuning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)