Table of Contents
Fetching ...

Multi-Source Diffusion Models for Simultaneous Music Generation and Separation

Giorgio Mariani, Irene Tallini, Emilian Postolache, Michele Mancusi, Luca Cosmo, Emanuele Rodolà

TL;DR

This work defines a diffusion-based generative model capable of both music synthesis and source separation by learning the score of the joint probability density of sources sharing a context, which is the first example of a single model that can handle both generation and separation tasks.

Abstract

In this work, we define a diffusion-based generative model capable of both music synthesis and source separation by learning the score of the joint probability density of sources sharing a context. Alongside the classic total inference tasks (i.e., generating a mixture, separating the sources), we also introduce and experiment on the partial generation task of source imputation, where we generate a subset of the sources given the others (e.g., play a piano track that goes well with the drums). Additionally, we introduce a novel inference method for the separation task based on Dirac likelihood functions. We train our model on Slakh2100, a standard dataset for musical source separation, provide qualitative results in the generation settings, and showcase competitive quantitative results in the source separation setting. Our method is the first example of a single model that can handle both generation and separation tasks, thus representing a step toward general audio models.

Multi-Source Diffusion Models for Simultaneous Music Generation and Separation

TL;DR

This work defines a diffusion-based generative model capable of both music synthesis and source separation by learning the score of the joint probability density of sources sharing a context, which is the first example of a single model that can handle both generation and separation tasks.

Abstract

In this work, we define a diffusion-based generative model capable of both music synthesis and source separation by learning the score of the joint probability density of sources sharing a context. Alongside the classic total inference tasks (i.e., generating a mixture, separating the sources), we also introduce and experiment on the partial generation task of source imputation, where we generate a subset of the sources given the others (e.g., play a piano track that goes well with the drums). Additionally, we introduce a novel inference method for the separation task based on Dirac likelihood functions. We train our model on Slakh2100, a standard dataset for musical source separation, provide qualitative results in the generation settings, and showcase competitive quantitative results in the source separation setting. Our method is the first example of a single model that can handle both generation and separation tasks, thus representing a step toward general audio models.
Paper Structure (25 sections, 40 equations, 3 figures, 6 tables, 1 algorithm)

This paper contains 25 sections, 40 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: Proposed method. We leverage a forward Gaussian process (right-to-left) to learn the score over contextual sets (the large boxes) of instrumental sources (the waveforms) across different time steps $t$. During inference, the process is reversed (left-to-right), letting us perform the tasks of total generation, partial generation, or source separation (Figure \ref{['fig:msdm']}).
  • Figure 2: Inference tasks with MSDM. Oblique lines represent the presence of noise in the signal, decreasing from left to right, with the highest noise level at time $T$ when we start the sampling procedure. Top-left: We generate all stems in a mixture, obtaining a total generation. Bottom-left: We perform partial generation (source imputation) by fixing the sources $\mathbf{x}_1$ (Bass) and $\mathbf{x}_3$ (Piano) and generating the other two sources $\hat{\mathbf{x}}_2{(0)}$ (Drums) and $\hat{\mathbf{x}}_4{(0)}$ (Guitar). We denote with $\mathbf{x}_1(t)$ and $\mathbf{x}_3(t)$, the noisy stems obtained from $\mathbf{x}_1$ and $\mathbf{x}_3$ via the perturbation kernel in Eq. \ref{['eq:perturbation_kernel']}. Right: We perform source separation by conditioning the prior with a mixture $\mathbf{y}$, following Algorithm \ref{['alg:total_sep']}.
  • Figure 3: Snippets from the subjective evaluation form. The first row is relative to total generation, where people were asked to evaluate 30 songs, 15 of which were generated by the mixture model and 15 by MSDM. Thirty-two people answered the survey. The second row is relative to partial generation. Subjects were asked to evaluate 15 songs. A random subset of sources is fixed for each song, and MSDM generates the other. The requested sources are explicitly stated above the song (e.g., in the snippet, the model has generated only the Bass stem). Twenty-one subjects answered.