ConSep: a Noise- and Reverberation-Robust Speech Separation Framework by Magnitude Conditioning
Kuan-Hsun Ho, Jeih-weih Hung, Berlin Chen
TL;DR
ConSep addresses robust speech separation under noise and reverberation by conditioning time-domain features on magnitude spectrogram information. It combines a learnable encoder and an STFT-based path with MulCA, modulated by FiLM using the magnitude, and uses a SepFormer-based mask estimator with a time-domain decoder. Across anechoic, noisy, and reverberant conditions, ConSep outperforms strong baselines and exhibits stability under various conditioning ablations, confirming the value of magnitude conditioning. The approach offers practical gains for real-world speech separation and is supported by qualitative visualizations that illuminate the advantages of focusing on informative frequency bands and harmonic structure.
Abstract
Speech separation has recently made significant progress thanks to the fine-grained vision used in time-domain methods. However, several studies have shown that adopting Short-Time Fourier Transform (STFT) for feature extraction could be beneficial when encountering harsher conditions, such as noise or reverberation. Therefore, we propose a magnitude-conditioned time-domain framework, ConSep, to inherit the beneficial characteristics. The experiment shows that ConSep promotes performance in anechoic, noisy, and reverberant settings compared to two celebrated methods, SepFormer and Bi-Sep. Furthermore, we visualize the components of ConSep to strengthen the advantages and cohere with the actualities we have found in preliminary studies.
