Table of Contents
Fetching ...

High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching

Gael Le Lan, Bowen Shi, Zhaoheng Ni, Sidd Srinivasan, Anurag Kumar, Brian Ellis, David Kant, Varun Nagaraja, Ernie Chang, Wei-Ning Hsu, Yangyang Shi, Vikas Chandra

TL;DR

MelodyFlow, an efficient text-controllable high-fidelity music generation and editing model based on a diffusion transformer architecture trained on a flow-matching objective, outperforms both ReNoise and DDIM for zero-shot test-time text-guided editing on several objective metrics.

Abstract

We introduce MelodyFlow, an efficient text-controllable high-fidelity music generation and editing model. It operates on continuous latent representations from a low frame rate 48 kHz stereo variational auto encoder codec. Based on a diffusion transformer architecture trained on a flow-matching objective the model can edit diverse high quality stereo samples of variable duration, with simple text descriptions. We adapt the ReNoise latent inversion method to flow matching and compare it with the original implementation and naive denoising diffusion implicit model (DDIM) inversion on a variety of music editing prompts. Our results indicate that our latent inversion outperforms both ReNoise and DDIM for zero-shot test-time text-guided editing on several objective metrics. Subjective evaluations exhibit a substantial improvement over previous state of the art for music editing. Code and model weights will be publicly made available. Samples are available at https://melodyflow.github.io.

High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching

TL;DR

MelodyFlow, an efficient text-controllable high-fidelity music generation and editing model based on a diffusion transformer architecture trained on a flow-matching objective, outperforms both ReNoise and DDIM for zero-shot test-time text-guided editing on several objective metrics.

Abstract

We introduce MelodyFlow, an efficient text-controllable high-fidelity music generation and editing model. It operates on continuous latent representations from a low frame rate 48 kHz stereo variational auto encoder codec. Based on a diffusion transformer architecture trained on a flow-matching objective the model can edit diverse high quality stereo samples of variable duration, with simple text descriptions. We adapt the ReNoise latent inversion method to flow matching and compare it with the original implementation and naive denoising diffusion implicit model (DDIM) inversion on a variety of music editing prompts. Our results indicate that our latent inversion outperforms both ReNoise and DDIM for zero-shot test-time text-guided editing on several objective metrics. Subjective evaluations exhibit a substantial improvement over previous state of the art for music editing. Code and model weights will be publicly made available. Samples are available at https://melodyflow.github.io.
Paper Structure (44 sections, 4 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 44 sections, 4 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of the MelodyFlow editing process. A waveform is encoded into $\mathbf{x}_{src}$ before being fed to the ODE solver. Step-by-step, the DiT predicts the velocity $\delta$ from data to noise, while being regularized against the prediction of an artificially constructed $\mathbf{\tilde{z}}_t$ so as to enhance editability. Once the target inversion flow step $T_{edit}$ has been reached, the model is used in the classic generation setting (bottom of the Figure, from right to left), except that the starting latent $\mathbf{z}_{t_{edit}}$ has been estimated so as to achieve better editability and consistency with the source waveform.
  • Figure 2: Effect of the regularization weight $\lambda_{KL}$ on the quality (Figure \ref{['fig:lkl_fad']}) and text-adherence (Figure \ref{['fig:lkl_clap']}) of music editing. $\epsilon$- and $v$-prediction are compared with or without $c_{orig}$.
  • Figure 3: Music editing quality as a function of the target inversion step $T_{edit}$. We report FAD$_{edit}$ (Figure \ref{['fig:tgt_fad']}), CLAP$_{edit}$ (Figure \ref{['fig:tgt_clap']}) and LPAPS (Figure \ref{['fig:tgt_mse']}) objective metrics.
  • Figure 4: Efficiency-quality trade offs of MelodyFlow in the text-guided music editing setting, measured using objective metrics. Objective metrics (FAD$_{edit}$ in the Figure \ref{['fig:fad_nfe']}, CLAP$_{edit}$ in the Figure \ref{['fig:clap_nfe']} and LPAPS in the Figure \ref{['fig:lpaps_nfe']}) indicate a sweet spot around 128 NFE.
  • Figure 5: Music editing subjective evaluation form. Given the original song A, raters are asked to evaluate three different edits of A, on the three following axes: quality, text adherence, consistency.
  • ...and 3 more figures