Table of Contents
Fetching ...

MusRec: Zero-Shot Text-to-Music Editing via Rectified Flow and Diffusion Transformers

Ali Boudaghi, Hadi Zare

TL;DR

MusRec introduces a zero-shot text-to-music editing framework built on rectified flow and diffusion transformers to edit real-world audio without retraining. By inverting audio into a rectified-flow latent space with RF-Solver and injecting inversion-derived attention features during denoising, MusRec achieves timbre and genre edits while preserving structure and content. The approach uses a VAE-based audio encoder, KV/ V attention-injection strategies, and classifier-free guidance to balance semantic alignment with musical fidelity, evaluated on small timbre and genre datasets with both objective and subjective metrics. Results show that KV-injection offers the best overall trade-off between transferability and fidelity, demonstrating practical zero-shot editing capabilities for real recordings. This work establishes rectified-flow-based editing as a viable foundation for flexible, high-quality, real-audio music transformation without task-specific training.

Abstract

Music editing has emerged as an important and practical area of artificial intelligence, with applications ranging from video game and film music production to personalizing existing tracks according to user preferences. However, existing models face significant limitations, such as being restricted to editing synthesized music generated by their own models, requiring highly precise prompts, or necessitating task-specific retraining, thus lacking true zero-shot capability. leveraging recent advances in rectified flow and diffusion transformers, we introduce MusRec, a zero-shot text-to-music editing model capable of performing diverse editing tasks on real-world music efficiently and effectively. Experimental results demonstrate that our approach outperforms existing methods in preserving musical content, structural consistency, and editing fidelity, establishing a strong foundation for controllable music editing in real-world scenarios.

MusRec: Zero-Shot Text-to-Music Editing via Rectified Flow and Diffusion Transformers

TL;DR

MusRec introduces a zero-shot text-to-music editing framework built on rectified flow and diffusion transformers to edit real-world audio without retraining. By inverting audio into a rectified-flow latent space with RF-Solver and injecting inversion-derived attention features during denoising, MusRec achieves timbre and genre edits while preserving structure and content. The approach uses a VAE-based audio encoder, KV/ V attention-injection strategies, and classifier-free guidance to balance semantic alignment with musical fidelity, evaluated on small timbre and genre datasets with both objective and subjective metrics. Results show that KV-injection offers the best overall trade-off between transferability and fidelity, demonstrating practical zero-shot editing capabilities for real recordings. This work establishes rectified-flow-based editing as a viable foundation for flexible, high-quality, real-audio music transformation without task-specific training.

Abstract

Music editing has emerged as an important and practical area of artificial intelligence, with applications ranging from video game and film music production to personalizing existing tracks according to user preferences. However, existing models face significant limitations, such as being restricted to editing synthesized music generated by their own models, requiring highly precise prompts, or necessitating task-specific retraining, thus lacking true zero-shot capability. leveraging recent advances in rectified flow and diffusion transformers, we introduce MusRec, a zero-shot text-to-music editing model capable of performing diverse editing tasks on real-world music efficiently and effectively. Experimental results demonstrate that our approach outperforms existing methods in preserving musical content, structural consistency, and editing fidelity, establishing a strong foundation for controllable music editing in real-world scenarios.

Paper Structure

This paper contains 28 sections, 12 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The source audio is first inverted into noise and then denoised to generate the edited audio. During denoising, the self-attention operations within the single blocks are modified according to their corresponding inversion steps. Note that the architecture comprises multiple single and double blocks, although only one of each is illustrated for clarity.
  • Figure 2: Transferability--fidelity trade-off effects of injection steps and IB (injection block) count on the timbre transfer task. The diagram shows the results of injecting the value ($V$) components of the attention mechanism into generation, i.e., how $V$-injection affects fidelity and transferability of the edited audio. For results of injecting the key ($K$) components or both key and value ($K+V$), and for all related results of genre transfer, see the Appendix.
  • Figure 3: Results of injecting the key ($K$) components of the attention mechanism during timbre transfer task. Injecting $K$ leads to moderate improvements in transferability but slightly weaker fidelity compared to $V$-injection, as less low-level acoustic information is preserved.
  • Figure 4: Results of injecting both key and value ($K+V$) components of the attention mechanism during timbre transfer task. Injecting $K+V$ tends to balance fidelity and transferability, yielding more consistent timbre adaptation while retaining semantic control.
  • Figure 5: Results of injecting the value (V) components of the attention mechanism during genre transfer. Injecting V mainly preserves fidelity while limiting the degree of stylistic transfer, showing more stable tonal similarity across genres.
  • ...and 2 more figures