MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

Yixiao Zhang; Yukara Ikemiya; Gus Xia; Naoki Murata; Marco A. Martínez-Ramírez; Wei-Hsiang Liao; Yuki Mitsufuji; Simon Dixon

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

Yixiao Zhang, Yukara Ikemiya, Gus Xia, Naoki Murata, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Yuki Mitsufuji, Simon Dixon

TL;DR

A novel approach to edit music generated by text-to-music generation models, enabling the modification of specific attributes, such as genre, mood, and instrument, while maintaining other aspects unchanged.

Abstract

Recent advances in text-to-music generation models have opened new avenues in musical creativity. However, music generation usually involves iterative refinements, and how to edit the generated music remains a significant challenge. This paper introduces a novel approach to the editing of music generated by such models, enabling the modification of specific attributes, such as genre, mood and instrument, while maintaining other aspects unchanged. Our method transforms text editing to \textit{latent space manipulation} while adding an extra constraint to enforce consistency. It seamlessly integrates with existing pretrained text-to-music diffusion models without requiring additional training. Experimental results demonstrate superior performance over both zero-shot and certain supervised baselines in style and timbre transfer evaluations. Additionally, we showcase the practical applicability of our approach in real-world music editing scenarios.

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

TL;DR

Abstract

Paper Structure (22 sections, 13 equations, 4 figures, 4 tables)

This paper contains 22 sections, 13 equations, 4 figures, 4 tables.

Introduction
Related Work
Text-to-Music Generation
Text-to-Music Editing
Background
Method
Finding Editing Direction
Adding Constraints Over Cross-Attention
Experiments
Baselines
Metrics
Data Preparation
Objective Experiments
Subjective Experiments
Experimental Setup
...and 7 more sections

Figures (4)

Figure 1: Text-to-music editing with MusicMagus. The edit from "piano" to "acoustic guitar" in the text prompt directly alters the corresponding musical attribute, while leaving others unchanged.
Figure 2: The pipeline of finding the editing direction $\Delta$. We first use InstructGPT to generate a large number of captions and then calculate the mean difference between the two embedding sets.
Figure 3: The workflow of the MusicMagus model. To constrain the diffusion model at timestep $t$, we need to: (1) calculate the L2 loss $L_t$ between the cross-attention map $M^\text{edit}_t$ and $M^\text{origin}_t$; (2) compute the gradient of $L_t$ with respect to $z_t$, and then perform a single-step optimization to update $\epsilon_\theta^\text{edit}$ of the diffusion model.
Figure 4: The diagram of the real music audio editing pipeline using MusicMagus with DDIM inversion and diffusion model editing.

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

TL;DR

Abstract

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)