Table of Contents
Fetching ...

Training-Free Multi-Step Audio Source Separation

Yongyi Zang, Jingyi Li, Qiuqiang Kong

TL;DR

This work tackles the limit of one-step audio source separation by introducing a training-free, multi-step inference strategy that iteratively refines outputs via remixing the original mixture with prior estimates and selecting the best candidate with a quality metric. The core method updates $x_t = r_t x_0 + (1 - r_t) y_{t-1}$, searches over $K$ ratios to maximize $R(f(x_t^{(k)}))$, and sets $y_t = f(x_t^*)$, with theoretical guarantees that the metric is non-decreasing and explicit error bounds that depend on Lipschitz constants and metric noise. It also connects this practical inference technique to denoising diffusion bridge models, arguing that data-mixing augmentation during training imprints a denoising capability along a linear interpolation between noise and clean signals. Empirically, the approach yields consistent gains over one-step inference in both speech enhancement and music source separation, approaching the benefits of larger models or multi-step training while incurring modest computation. The work provides open-source code and offers a principled bridge-model perspective that could inform future design of training-efficient, inference-time scalable audio processing systems.

Abstract

Audio source separation aims to separate a mixture into target sources. Previous audio source separation systems usually conduct one-step inference, which does not fully explore the separation ability of models. In this work, we reveal that pretrained one-step audio source separation models can be leveraged for multi-step separation without additional training. We propose a simple yet effective inference method that iteratively applies separation by optimally blending the input mixture with the previous step's separation result. At each step, we determine the optimal blending ratio by maximizing a metric. We prove that our method always yield improvement over one-step inference, provide error bounds based on model smoothness and metric robustness, and provide theoretical analysis connecting our method to denoising along linear interpolation paths between noise and clean distributions, a property we link to denoising diffusion bridge models. Our approach effectively delivers improved separation performance as a "free lunch" from existing models. Our empirical results demonstrate that our multi-step separation approach consistently outperforms one-step inference across both speech enhancement and music source separation tasks, and can achieve scaling performance similar to training a larger model, using more data, or in some cases employing a multi-step training objective. These improvements appear not only on the optimization metric during multi-step inference, but also extend to nearly all non-optimized metrics (with one exception). We also discuss limitations of our approach and directions for future research.

Training-Free Multi-Step Audio Source Separation

TL;DR

This work tackles the limit of one-step audio source separation by introducing a training-free, multi-step inference strategy that iteratively refines outputs via remixing the original mixture with prior estimates and selecting the best candidate with a quality metric. The core method updates , searches over ratios to maximize , and sets , with theoretical guarantees that the metric is non-decreasing and explicit error bounds that depend on Lipschitz constants and metric noise. It also connects this practical inference technique to denoising diffusion bridge models, arguing that data-mixing augmentation during training imprints a denoising capability along a linear interpolation between noise and clean signals. Empirically, the approach yields consistent gains over one-step inference in both speech enhancement and music source separation, approaching the benefits of larger models or multi-step training while incurring modest computation. The work provides open-source code and offers a principled bridge-model perspective that could inform future design of training-efficient, inference-time scalable audio processing systems.

Abstract

Audio source separation aims to separate a mixture into target sources. Previous audio source separation systems usually conduct one-step inference, which does not fully explore the separation ability of models. In this work, we reveal that pretrained one-step audio source separation models can be leveraged for multi-step separation without additional training. We propose a simple yet effective inference method that iteratively applies separation by optimally blending the input mixture with the previous step's separation result. At each step, we determine the optimal blending ratio by maximizing a metric. We prove that our method always yield improvement over one-step inference, provide error bounds based on model smoothness and metric robustness, and provide theoretical analysis connecting our method to denoising along linear interpolation paths between noise and clean distributions, a property we link to denoising diffusion bridge models. Our approach effectively delivers improved separation performance as a "free lunch" from existing models. Our empirical results demonstrate that our multi-step separation approach consistently outperforms one-step inference across both speech enhancement and music source separation tasks, and can achieve scaling performance similar to training a larger model, using more data, or in some cases employing a multi-step training objective. These improvements appear not only on the optimization metric during multi-step inference, but also extend to nearly all non-optimized metrics (with one exception). We also discuss limitations of our approach and directions for future research.

Paper Structure

This paper contains 18 sections, 3 theorems, 31 equations, 12 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

There always exists an optimal $x_{t}^{*}$ such that $R(y_{t}) \geq R(y_{0})$.

Figures (12)

  • Figure 1: PESQ across all inference steps for VCTK-DEMAND test set.
  • Figure 2: STOI across all inference steps for VCTK-DEMAND test set.
  • Figure 3: SI-SNR across all inference steps for VCTK-DEMAND test set.
  • Figure 4: UTMOS across all inference steps for DNS Challenge V3 test set.
  • Figure 5: DNSMOS SIG across all inference steps for DNS Challenge V3 test set.
  • ...and 7 more figures

Theorems & Definitions (6)

  • Theorem 1: Lower Bound of Metrics
  • Proof 1
  • Lemma 1: Lipschitz Bound on Derivative
  • Theorem 2: Error Bound
  • Proof 2
  • Proof 3