Generating Separated Singing Vocals Using a Diffusion Model Conditioned on Music Mixtures
Genís Plaja-Roglans, Yun-Ning Hung, Xavier Serra, Igor Pereira
TL;DR
The paper addresses singing voice separation from real mixtures by conditioning a diffusion model on the input mixture, enabling waveform-domain vocal generation without requiring perfect source sums. It introduces a generation network and a sophisticated conditioner, augmented with multi-resolution conditioning, self-attention, and optional transformer components, plus auxiliary losses to align latent representations with the target vocal. Through a stochastic DDIM-based sampling process with frequency-selective refinement, the approach achieves strong SDR and SIR, outperforming prior generative baselines and approaching non-generative methods when trained with extra data; it also highlights sampling efficiency and tunable quality-speed trade-offs. This work advances diffusion-based MSS by delivering a lighter, flexible model with practical sampling controls and open-source resources for further research.
Abstract
Separating the individual elements in a musical mixture is an essential process for music analysis and practice. While this is generally addressed using neural networks optimized to mask or transform the time-frequency representation of a mixture to extract the target sources, the flexibility and generalization capabilities of generative diffusion models are giving rise to a novel class of solutions for this complicated task. In this work, we explore singing voice separation from real music recordings using a diffusion model which is trained to generate the solo vocals conditioned on the corresponding mixture. Our approach improves upon prior generative systems and achieves competitive objective scores against non-generative baselines when trained with supplementary data. The iterative nature of diffusion sampling enables the user to control the quality-efficiency trade-off, and also refine the output when needed. We present an ablation study of the sampling algorithm, highlighting the effects of the user-configurable parameters.
