Improving Generalization of Speech Separation in Real-World Scenarios: Strategies in Simulation, Optimization, and Evaluation
Ke Chen, Jiaqi Su, Taylor Berg-Kirkpatrick, Shlomo Dubnov, Zeyu Jin
TL;DR
The paper tackles the problem of robust single-channel speech separation across varied real-world acoustics by introducing Acoustic-Content Simulation (AC-SIM), a data-generation pipeline that jointly varies content and acoustic properties. It enhances Permutation Invariant Training (PIT) with perceptual losses—multi-resolution STFT magnitude, mel-spectrogram, and time-domain L2—to form a unified objective that improves both separation quality and generalization. Extensive experiments across ConvTasNet, DPRNN, and SepFormer backbones on benchmarks such as wsj0-2mix, WHAM!, WHAMR!, and Libri2Mix demonstrate that AC-SIM, particularly when using multiple objectives (AC-SIM-ML), yields superior generalization to non-homologous and real-world data. Subjective MOS tests corroborate perceptual gains and reveal nuances between objective SDR improvements and human judgments, underscoring the value of integrating perceptual losses. The findings suggest that data-driven diversification of content and acoustics, along with enriched loss functions, can produce more robust separators suitable for diverse real-world scenarios.
Abstract
Achieving robust speech separation for overlapping speakers in various acoustic environments with noise and reverberation remains an open challenge. Although existing datasets are available to train separators for specific scenarios, they do not effectively generalize across diverse real-world scenarios. In this paper, we present a novel data simulation pipeline that produces diverse training data from a range of acoustic environments and content, and propose new training paradigms to improve quality of a general speech separation model. Specifically, we first introduce AC-SIM, a data simulation pipeline that incorporates broad variations in both content and acoustics. Then we integrate multiple training objectives into the permutation invariant training (PIT) to enhance separation quality and generalization of the trained model. Finally, we conduct comprehensive objective and human listening experiments across separation architectures and benchmarks to validate our methods, demonstrating substantial improvement of generalization on both non-homologous and real-world test sets.
