MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction
Yunkee Chae, Kyogu Lee
TL;DR
MGE-LDM introduces a unified latent-diffusion model that jointly handles music generation, source imputation, and language-driven source extraction by learning a joint latent space over mixture, submixture, and source embeddings. By formulating both synthesis and extraction as conditional inpainting in the latent domain and employing track-specific adaptive timesteps, the approach remains class-agnostic and dataset-agnostic, enabling training across heterogeneous multi-track datasets without predefined instrument vocabularies. Empirically, MGE-LDM achieves competitive generation quality and robust source extraction across Slakh2100, MUSDB18, and MoisesDB, especially when trained on combined data, and supports zero-shot, language-guided manipulation of arbitrary stems. The work highlights practical implications for flexible remixing, stem-level editing, and cross-domain robustness in multi-track music modeling, while outlining future directions for higher fidelity, stereo rendering, and improved additivity regularization in latent space.
Abstract
We present MGE-LDM, a unified latent diffusion framework for simultaneous music generation, source imputation, and query-driven source separation. Unlike prior approaches constrained to fixed instrument classes, MGE-LDM learns a joint distribution over full mixtures, submixtures, and individual stems within a single compact latent diffusion model. At inference, MGE-LDM enables (1) complete mixture generation, (2) partial generation (i.e., source imputation), and (3) text-conditioned extraction of arbitrary sources. By formulating both separation and imputation as conditional inpainting tasks in the latent space, our approach supports flexible, class-agnostic manipulation of arbitrary instrument sources. Notably, MGE-LDM can be trained jointly across heterogeneous multi-track datasets (e.g., Slakh2100, MUSDB18, MoisesDB) without relying on predefined instrument categories. Audio samples are available at our project page: https://yoongi43.github.io/MGELDM_Samples/.
