MusicWeaver: Composer-Style Structural Editing and Minute-Scale Coherent Music Generation
Xuanchen Wang, Heng Wang, Weidong Cai
TL;DR
MusicWeaver tackles the challenge of long-form music generation with two core goals: composer-style structural editing and minute-scale coherence. It introduces a two-stage framework that predicts a human-interpretable plan $\mathcal{P}=(\mathcal{B},\mathcal{G},\mathcal{A})$ and renders audio conditioned on this plan via a Global--Local Diffusion Transformer (GL-DiT). Key innovations include Motif Memory Retrieval (MMR) for consistent motif recurrence and Projected Diffusion Inpainting (PDI) for drift-free localized edits, along with Structure Coherence Score (SCS) and Edit Fidelity Score (EFS) to quantify long-range form and edit realization. Experiments on Text-to-Music and Video-to-Music tasks show state-of-the-art fidelity, controllability, and coherent long-range structure, validating the effectiveness of plan-based conditioning and drift-free editing for practical music creation.
Abstract
Recent advances in music generation produce impressive samples, however, practical creation still lacks two key capabilities: composer-style structural editing and minute-scale coherence. We present MusicWeaver, a framework for generating and editing long-range music using a human-interpretable intermediate representation with guaranteed edit locality. MusicWeaver decomposes generation into two stages: it first predicts a structured plan, a multi-level song program encoding musical attributes that composers can directly edit, and then renders audio conditioned on this plan. To ensure minute-scale coherence, we introduce a Global-Local Diffusion Transformer, where a global path captures long-range musical progression via compressed representations and memory, while a local path synthesizes fine-grained acoustic detail. We further propose a Motif Memory Retrieval module that enables consistent motif recurrence with controllable variation. For editing, we propose Projected Diffusion Inpainting, an inpainting method that denoises only user-specified regions and preserves unchanged content, allowing repeated edits without drift. Finally, we introduce Structure Coherence Score and Edit Fidelity Score to evaluate long-range form and edit realization. Experiments demonstrate that MusicWeaver achieves state-of-the-art fidelity, controllability, and long-range coherence.
