Table of Contents
Fetching ...

TISDiSS: A Training-Time and Inference-Time Scalable Framework for Discriminative Source Separation

Yongsheng Feng, Yuetonghui Xu, Jiehui Luo, Hongjia Liu, Xiaobing Li, Feng Yu, Wei Li

TL;DR

TISDiSS addresses the tension between separation quality and compute cost by unifying training-time and inference-time scalability for discriminative source separation. It introduces a TF-domain five-component architecture with early-split multi-loss supervision, shared-parameter design, and dynamic inference repetitions, enabling flexible speed–quality trade-offs with a single trained model. Empirical results on WSJ0-2mix, Libri2Mix, and WHAMR! show state-of-the-art performance with fewer parameters and robust gains across noisy and reverberant data; training with more inference repetitions also improves shallow-inference performance, validating the proposed training-time scalability. The framework supports practical deployment through various lightweight configurations and fine-tuning strategies, making adaptive audio separation feasible for diverse applications.

Abstract

Source separation is a fundamental task in speech, music, and audio processing, and it also provides cleaner and larger data for training generative models. However, improving separation performance in practice often depends on increasingly large networks, inflating training and deployment costs. Motivated by recent advances in inference-time scaling for generative modeling, we propose Training-Time and Inference-Time Scalable Discriminative Source Separation (TISDiSS), a unified framework that integrates early-split multi-loss supervision, shared-parameter design, and dynamic inference repetitions. TISDiSS enables flexible speed-performance trade-offs by adjusting inference depth without retraining additional models. We further provide systematic analyses of architectural and training choices and show that training with more inference repetitions improves shallow-inference performance, benefiting low-latency applications. Experiments on standard speech separation benchmarks demonstrate state-of-the-art performance with a reduced parameter count, establishing TISDiSS as a scalable and practical framework for adaptive source separation. Code is available at https://github.com/WingSingFung/TISDiSS.

TISDiSS: A Training-Time and Inference-Time Scalable Framework for Discriminative Source Separation

TL;DR

TISDiSS addresses the tension between separation quality and compute cost by unifying training-time and inference-time scalability for discriminative source separation. It introduces a TF-domain five-component architecture with early-split multi-loss supervision, shared-parameter design, and dynamic inference repetitions, enabling flexible speed–quality trade-offs with a single trained model. Empirical results on WSJ0-2mix, Libri2Mix, and WHAMR! show state-of-the-art performance with fewer parameters and robust gains across noisy and reverberant data; training with more inference repetitions also improves shallow-inference performance, validating the proposed training-time scalability. The framework supports practical deployment through various lightweight configurations and fine-tuning strategies, making adaptive audio separation feasible for diverse applications.

Abstract

Source separation is a fundamental task in speech, music, and audio processing, and it also provides cleaner and larger data for training generative models. However, improving separation performance in practice often depends on increasingly large networks, inflating training and deployment costs. Motivated by recent advances in inference-time scaling for generative modeling, we propose Training-Time and Inference-Time Scalable Discriminative Source Separation (TISDiSS), a unified framework that integrates early-split multi-loss supervision, shared-parameter design, and dynamic inference repetitions. TISDiSS enables flexible speed-performance trade-offs by adjusting inference depth without retraining additional models. We further provide systematic analyses of architectural and training choices and show that training with more inference repetitions improves shallow-inference performance, benefiting low-latency applications. Experiments on standard speech separation benchmarks demonstrate state-of-the-art performance with a reduced parameter count, establishing TISDiSS as a scalable and practical framework for adaptive source separation. Code is available at https://github.com/WingSingFung/TISDiSS.

Paper Structure

This paper contains 12 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overall architecture
  • Figure 2: Separation block
  • Figure 3: Reconstruction block (P&R: Permute&Reshape)
  • Figure 5: Ablation study with SI-SNRi [dB] on the $y$-axis and $N_{\text{re}}$ used during inference on the $x$-axis (WSJ0-2mix except (f)(g)).