Table of Contents
Fetching ...

Simultaneous Music Separation and Generation Using Multi-Track Latent Diffusion Models

Tornike Karchkhadze, Mohammad Rasool Izadi, Shlomo Dubnov

TL;DR

MSG-LD presents a unified latent-diffusion approach that learns the joint latent distribution of multiple music tracks to perform source separation, total multi-track generation, and arrangement generation within a single framework. It extends MusicLDM with a 3D multi-track latent space and uses classifier-free guidance to modulate the emphasis on conditioning, enabling seamless switching between separation and generation. Evaluated on Slakh2100, MSG-LD significantly improves separation metrics and generation quality (FAD), including arrangement tasks, compared to the MSDM baseline. Limitations include audio quality due to 16 kHz sampling and reliance on pretrained VAE/vocoder components; future work targets higher sampling rates and soft conditioning to broaden practical use.

Abstract

Diffusion models have recently shown strong potential in both music generation and music source separation tasks. Although in early stages, a trend is emerging towards integrating these tasks into a single framework, as both involve generating musically aligned parts and can be seen as facets of the same generative process. In this work, we introduce a latent diffusion-based multi-track generation model capable of both source separation and multi-track music synthesis by learning the joint probability distribution of tracks sharing a musical context. Our model also enables arrangement generation by creating any subset of tracks given the others. We trained our model on the Slakh2100 dataset, compared it with an existing simultaneous generation and separation model, and observed significant improvements across objective metrics for source separation, music, and arrangement generation tasks. Sound examples are available at https://msg-ld.github.io/.

Simultaneous Music Separation and Generation Using Multi-Track Latent Diffusion Models

TL;DR

MSG-LD presents a unified latent-diffusion approach that learns the joint latent distribution of multiple music tracks to perform source separation, total multi-track generation, and arrangement generation within a single framework. It extends MusicLDM with a 3D multi-track latent space and uses classifier-free guidance to modulate the emphasis on conditioning, enabling seamless switching between separation and generation. Evaluated on Slakh2100, MSG-LD significantly improves separation metrics and generation quality (FAD), including arrangement tasks, compared to the MSDM baseline. Limitations include audio quality due to 16 kHz sampling and reliance on pretrained VAE/vocoder components; future work targets higher sampling rates and soft conditioning to broaden practical use.

Abstract

Diffusion models have recently shown strong potential in both music generation and music source separation tasks. Although in early stages, a trend is emerging towards integrating these tasks into a single framework, as both involve generating musically aligned parts and can be seen as facets of the same generative process. In this work, we introduce a latent diffusion-based multi-track generation model capable of both source separation and multi-track music synthesis by learning the joint probability distribution of tracks sharing a musical context. Our model also enables arrangement generation by creating any subset of tracks given the others. We trained our model on the Slakh2100 dataset, compared it with an existing simultaneous generation and separation model, and observed significant improvements across objective metrics for source separation, music, and arrangement generation tasks. Sound examples are available at https://msg-ld.github.io/.
Paper Structure (13 sections, 1 equation, 1 figure, 3 tables)

This paper contains 13 sections, 1 equation, 1 figure, 3 tables.

Figures (1)

  • Figure 1: MSG-LD system overview: During training, audio tracks are converted into Mel-spectrograms and compressed into a 3D latent space by a VAE encoder, where LDM operates. The audio mixture is similarly processed and used as a condition by adding it to each U-Net layer. During inference, the model's conditioning is controlled by CFG weight, switching between source separation and music generation modes. The generated latent vectors are up-sampled to Mel-spectrograms by the VAE decoder and converted into audio via HiFi-GAN vocoder.