Table of Contents
Fetching ...

MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction

Yunkee Chae, Kyogu Lee

TL;DR

MGE-LDM introduces a unified latent-diffusion model that jointly handles music generation, source imputation, and language-driven source extraction by learning a joint latent space over mixture, submixture, and source embeddings. By formulating both synthesis and extraction as conditional inpainting in the latent domain and employing track-specific adaptive timesteps, the approach remains class-agnostic and dataset-agnostic, enabling training across heterogeneous multi-track datasets without predefined instrument vocabularies. Empirically, MGE-LDM achieves competitive generation quality and robust source extraction across Slakh2100, MUSDB18, and MoisesDB, especially when trained on combined data, and supports zero-shot, language-guided manipulation of arbitrary stems. The work highlights practical implications for flexible remixing, stem-level editing, and cross-domain robustness in multi-track music modeling, while outlining future directions for higher fidelity, stereo rendering, and improved additivity regularization in latent space.

Abstract

We present MGE-LDM, a unified latent diffusion framework for simultaneous music generation, source imputation, and query-driven source separation. Unlike prior approaches constrained to fixed instrument classes, MGE-LDM learns a joint distribution over full mixtures, submixtures, and individual stems within a single compact latent diffusion model. At inference, MGE-LDM enables (1) complete mixture generation, (2) partial generation (i.e., source imputation), and (3) text-conditioned extraction of arbitrary sources. By formulating both separation and imputation as conditional inpainting tasks in the latent space, our approach supports flexible, class-agnostic manipulation of arbitrary instrument sources. Notably, MGE-LDM can be trained jointly across heterogeneous multi-track datasets (e.g., Slakh2100, MUSDB18, MoisesDB) without relying on predefined instrument categories. Audio samples are available at our project page: https://yoongi43.github.io/MGELDM_Samples/.

MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction

TL;DR

MGE-LDM introduces a unified latent-diffusion model that jointly handles music generation, source imputation, and language-driven source extraction by learning a joint latent space over mixture, submixture, and source embeddings. By formulating both synthesis and extraction as conditional inpainting in the latent domain and employing track-specific adaptive timesteps, the approach remains class-agnostic and dataset-agnostic, enabling training across heterogeneous multi-track datasets without predefined instrument vocabularies. Empirically, MGE-LDM achieves competitive generation quality and robust source extraction across Slakh2100, MUSDB18, and MoisesDB, especially when trained on combined data, and supports zero-shot, language-guided manipulation of arbitrary stems. The work highlights practical implications for flexible remixing, stem-level editing, and cross-domain robustness in multi-track music modeling, while outlining future directions for higher fidelity, stereo rendering, and improved additivity regularization in latent space.

Abstract

We present MGE-LDM, a unified latent diffusion framework for simultaneous music generation, source imputation, and query-driven source separation. Unlike prior approaches constrained to fixed instrument classes, MGE-LDM learns a joint distribution over full mixtures, submixtures, and individual stems within a single compact latent diffusion model. At inference, MGE-LDM enables (1) complete mixture generation, (2) partial generation (i.e., source imputation), and (3) text-conditioned extraction of arbitrary sources. By formulating both separation and imputation as conditional inpainting tasks in the latent space, our approach supports flexible, class-agnostic manipulation of arbitrary instrument sources. Notably, MGE-LDM can be trained jointly across heterogeneous multi-track datasets (e.g., Slakh2100, MUSDB18, MoisesDB) without relying on predefined instrument categories. Audio samples are available at our project page: https://yoongi43.github.io/MGELDM_Samples/.

Paper Structure

This paper contains 31 sections, 40 equations, 6 figures, 11 tables, 2 algorithms.

Figures (6)

  • Figure 1: Overview of MGE-LDM. (a) Training pipeline: We train a three-track latent diffusion model on mixtures, submixtures, and sources. Each track is perturbed independently and conditioned on its corresponding timestep and CLAP embedding. The model is optimized using the v-objective, as detailed in Sections \ref{['subsec:3track_ldm_vobj']} and \ref{['subsec:adaptive_timestep']}. (b) Inference pipeline: At test time, task-specific latents are either generated or inpainted based on available context and text prompts. The resulting latents are decoded into waveforms. See Section \ref{['subsec:inference']} for details.
  • Figure 2: Total generation examples. Each sample displays Mel-spectrograms of the mixture, submixture, and source tracks, all generated simultaneously by MGE-LDM. The mixture track is used to evaluate the total generation output.
  • Figure 3: Source imputation examples. Each row illustrates source inpainting results by MGE-LDM, conditioned on the text prompt "The sound of the {label}". The middle column shows the provided context mixture (submix), the rightmost column is the generated source, and the leftmost column is the recombined mixture of the submix and generated source. While some stems are imputed accurately, others fail due to data imbalance during training.
  • Figure 4: Source extraction examples. Source extraction results produced by MGE-LDM, conditioned on the text query "The sound of the {label}". The leftmost column shows the input mixture, the middle column is the extracted source predicted by the model, and the rightmost column is the ground-truth source. We observe that extraction quality may degrade for underrepresented classes such as strings, and in some cases, the model hallucinates unrelated instruments or incorrect timbres.
  • Figure : Inpainting using the RePaint approach.
  • ...and 1 more figures