Table of Contents
Fetching ...

MusicWeaver: Composer-Style Structural Editing and Minute-Scale Coherent Music Generation

Xuanchen Wang, Heng Wang, Weidong Cai

TL;DR

MusicWeaver tackles the challenge of long-form music generation with two core goals: composer-style structural editing and minute-scale coherence. It introduces a two-stage framework that predicts a human-interpretable plan $\mathcal{P}=(\mathcal{B},\mathcal{G},\mathcal{A})$ and renders audio conditioned on this plan via a Global--Local Diffusion Transformer (GL-DiT). Key innovations include Motif Memory Retrieval (MMR) for consistent motif recurrence and Projected Diffusion Inpainting (PDI) for drift-free localized edits, along with Structure Coherence Score (SCS) and Edit Fidelity Score (EFS) to quantify long-range form and edit realization. Experiments on Text-to-Music and Video-to-Music tasks show state-of-the-art fidelity, controllability, and coherent long-range structure, validating the effectiveness of plan-based conditioning and drift-free editing for practical music creation.

Abstract

Recent advances in music generation produce impressive samples, however, practical creation still lacks two key capabilities: composer-style structural editing and minute-scale coherence. We present MusicWeaver, a framework for generating and editing long-range music using a human-interpretable intermediate representation with guaranteed edit locality. MusicWeaver decomposes generation into two stages: it first predicts a structured plan, a multi-level song program encoding musical attributes that composers can directly edit, and then renders audio conditioned on this plan. To ensure minute-scale coherence, we introduce a Global-Local Diffusion Transformer, where a global path captures long-range musical progression via compressed representations and memory, while a local path synthesizes fine-grained acoustic detail. We further propose a Motif Memory Retrieval module that enables consistent motif recurrence with controllable variation. For editing, we propose Projected Diffusion Inpainting, an inpainting method that denoises only user-specified regions and preserves unchanged content, allowing repeated edits without drift. Finally, we introduce Structure Coherence Score and Edit Fidelity Score to evaluate long-range form and edit realization. Experiments demonstrate that MusicWeaver achieves state-of-the-art fidelity, controllability, and long-range coherence.

MusicWeaver: Composer-Style Structural Editing and Minute-Scale Coherent Music Generation

TL;DR

MusicWeaver tackles the challenge of long-form music generation with two core goals: composer-style structural editing and minute-scale coherence. It introduces a two-stage framework that predicts a human-interpretable plan and renders audio conditioned on this plan via a Global--Local Diffusion Transformer (GL-DiT). Key innovations include Motif Memory Retrieval (MMR) for consistent motif recurrence and Projected Diffusion Inpainting (PDI) for drift-free localized edits, along with Structure Coherence Score (SCS) and Edit Fidelity Score (EFS) to quantify long-range form and edit realization. Experiments on Text-to-Music and Video-to-Music tasks show state-of-the-art fidelity, controllability, and coherent long-range structure, validating the effectiveness of plan-based conditioning and drift-free editing for practical music creation.

Abstract

Recent advances in music generation produce impressive samples, however, practical creation still lacks two key capabilities: composer-style structural editing and minute-scale coherence. We present MusicWeaver, a framework for generating and editing long-range music using a human-interpretable intermediate representation with guaranteed edit locality. MusicWeaver decomposes generation into two stages: it first predicts a structured plan, a multi-level song program encoding musical attributes that composers can directly edit, and then renders audio conditioned on this plan. To ensure minute-scale coherence, we introduce a Global-Local Diffusion Transformer, where a global path captures long-range musical progression via compressed representations and memory, while a local path synthesizes fine-grained acoustic detail. We further propose a Motif Memory Retrieval module that enables consistent motif recurrence with controllable variation. For editing, we propose Projected Diffusion Inpainting, an inpainting method that denoises only user-specified regions and preserves unchanged content, allowing repeated edits without drift. Finally, we introduce Structure Coherence Score and Edit Fidelity Score to evaluate long-range form and edit realization. Experiments demonstrate that MusicWeaver achieves state-of-the-art fidelity, controllability, and long-range coherence.

Paper Structure

This paper contains 19 sections, 14 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Example structured plan. Top: the segment-level song program with section types and motif ids. Middle: beat-grid information, including meter and tempo. Bottom: bar-aligned editable attributes, including energy, density, groove, harmony, and variation strength.
  • Figure 2: Overview of MusicWeaver. The model supports multiple prompt modalities (text, video, audio, or their fusion). Each modality is encoded by a pretrained encoder and projected into a shared embedding space. The resulting embeddings are fused using a confidence-weighted attention module. A plan generator first predicts a structured plan, which then conditions a diffusion renderer to synthesize the final music.
  • Figure 3: Architecture of MusicWeaver. MusicWeaver follows a two-stage pipeline. In the first stage, given a prompt, we obtain a conditioning embedding $E$ and predict a structured plan $\mathcal{P}$. Plan generation is hierarchical. We first predict the beat grid $\mathcal{B}$ using a Beat Grid Head. Conditioned on $(E,\mathcal{B})$, a program decoder then autoregressively generates the segment-level song program $\mathcal{G}$. Finally, conditioned on $(E,\mathcal{B},\mathcal{G})$, we predict the bar-level attributes $\mathcal{A}$. In the second stage, we synthesize music conditioned on the predicted plan $\mathcal{P}$. We employ a diffusion model whose denoiser is implemented as a Global--Local Diffusion Transformer (GL-DiT), which couples a global long-context pathway with a local high-resolution denoising pathway. At each diffusion step $\tau$, GL-DiT takes as input the noisy VAE latent $z_\tau$, the timestep embedding $e_\tau$, time-aligned controls $\mathcal{C}$, and plan tokens $T_{\mathrm{plan}}$, and outputs a noise prediction.
  • Figure 4: User study results of generated music. The values represent the average OVL and REL scores across Text-to-Music (on MusicCaps), Video-to-Music (on V2M-bench)