Table of Contents
Fetching ...

ZeroSep: Separate Anything in Audio with Zero Training

Chao Huang, Yuesheng Ma, Junxuan Huang, Susan Liang, Yunlong Tang, Jing Bi, Wenqiang Liu, Nima Mesgarani, Chenliang Xu

TL;DR

This paper tackles open-set audio source separation without labeled training data. It introduces ZeroSep, a zero-training framework that repurposes pre-trained text-guided audio diffusion models by first inverting a mixed signal into the model's latent space and then applying text-conditioned denoising to recover individual sources. The approach achieves state-of-the-art or competitive performance on AVE and MUSIC benchmarks, often surpassing supervised methods, and operates across diverse mixture types thanks to rich textual priors. Key contributions include a training-free separation paradigm, model-agnostic applicability across diffusion backbones, and demonstrated open-set capabilities with robust qualitative and perceptual metrics. This work highlights the potential of diffusion priors to perform discriminative tasks and broadens the toolkit for practical, data-efficient audio separation.

Abstract

Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of generative foundation models, we investigate whether pre-trained text-guided audio diffusion models can overcome these limitations. We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model under the right configuration. Our method, named ZeroSep, works by inverting the mixed audio into the diffusion model's latent space and then using text conditioning to guide the denoising process to recover individual sources. Without any task-specific training or fine-tuning, ZeroSep repurposes the generative diffusion model for a discriminative separation task and inherently supports open-set scenarios through its rich textual priors. ZeroSep is compatible with a variety of pre-trained text-guided audio diffusion backbones and delivers strong separation performance on multiple separation benchmarks, surpassing even supervised methods.

ZeroSep: Separate Anything in Audio with Zero Training

TL;DR

This paper tackles open-set audio source separation without labeled training data. It introduces ZeroSep, a zero-training framework that repurposes pre-trained text-guided audio diffusion models by first inverting a mixed signal into the model's latent space and then applying text-conditioned denoising to recover individual sources. The approach achieves state-of-the-art or competitive performance on AVE and MUSIC benchmarks, often surpassing supervised methods, and operates across diverse mixture types thanks to rich textual priors. Key contributions include a training-free separation paradigm, model-agnostic applicability across diffusion backbones, and demonstrated open-set capabilities with robust qualitative and perceptual metrics. This work highlights the potential of diffusion priors to perform discriminative tasks and broadens the toolkit for practical, data-efficient audio separation.

Abstract

Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of generative foundation models, we investigate whether pre-trained text-guided audio diffusion models can overcome these limitations. We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model under the right configuration. Our method, named ZeroSep, works by inverting the mixed audio into the diffusion model's latent space and then using text conditioning to guide the denoising process to recover individual sources. Without any task-specific training or fine-tuning, ZeroSep repurposes the generative diffusion model for a discriminative separation task and inherently supports open-set scenarios through its rich textual priors. ZeroSep is compatible with a variety of pre-trained text-guided audio diffusion backbones and delivers strong separation performance on multiple separation benchmarks, surpassing even supervised methods.

Paper Structure

This paper contains 14 sections, 6 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: The overview of ZeroSep, which includes (a) an inversion process to obtain a latent representation for the mixture, and (b) a separation denoising process to effectively extract the target source with text conditions. We show the choice of inversion prompt $\mathbf{c_{\text{inv}}}$ and reverse prompt $\mathbf{c_{\text{rev}}}$ in (c), and demonstrate the valid separation region defined by $\omega$ in (d).
  • Figure 2: Qualitative visualization of audio separation results. The figure shows the input mixture (containing speech and dog barking) and the separated "dog barking" source produced by different baselines and ZeroSep. ZeroSep, guided by the text prompt "dog bark", successfully isolates the target sound, demonstrating its effectiveness compared to baseline methods. More separation results can be found in the supplementary materials.
  • Figure 3: (a) Impact of guidance weight $\omega$: increasing $\omega$ from 0 to 1 improves separation metrics (LPAPS and CLAP-A), whereas $\omega>1$ degrades performance below the mixture baseline ($\omega=0$), underscoring the critical role of $\omega$. (b)–(c) Positive correlation between separation quality (normalized all scores from \ref{['tab:inversion_model']}) and generative capability (normalized FAD scores on AudioCap liu2023audioldm, audioldm2-2024taslp) across AudioLDM variants, indicating that stronger generation can potentially lead to better separation.
  • Figure 4: Failure case analysis of ZeroSep. Mixture: Man speech (stem 1) + Shofar (stem 2).
  • Figure 5: Mixture: Cello (stem 1) + Erhu (Stem 2)
  • ...and 11 more figures