Table of Contents
Fetching ...

ConSep: a Noise- and Reverberation-Robust Speech Separation Framework by Magnitude Conditioning

Kuan-Hsun Ho, Jeih-weih Hung, Berlin Chen

TL;DR

ConSep addresses robust speech separation under noise and reverberation by conditioning time-domain features on magnitude spectrogram information. It combines a learnable encoder and an STFT-based path with MulCA, modulated by FiLM using the magnitude, and uses a SepFormer-based mask estimator with a time-domain decoder. Across anechoic, noisy, and reverberant conditions, ConSep outperforms strong baselines and exhibits stability under various conditioning ablations, confirming the value of magnitude conditioning. The approach offers practical gains for real-world speech separation and is supported by qualitative visualizations that illuminate the advantages of focusing on informative frequency bands and harmonic structure.

Abstract

Speech separation has recently made significant progress thanks to the fine-grained vision used in time-domain methods. However, several studies have shown that adopting Short-Time Fourier Transform (STFT) for feature extraction could be beneficial when encountering harsher conditions, such as noise or reverberation. Therefore, we propose a magnitude-conditioned time-domain framework, ConSep, to inherit the beneficial characteristics. The experiment shows that ConSep promotes performance in anechoic, noisy, and reverberant settings compared to two celebrated methods, SepFormer and Bi-Sep. Furthermore, we visualize the components of ConSep to strengthen the advantages and cohere with the actualities we have found in preliminary studies.

ConSep: a Noise- and Reverberation-Robust Speech Separation Framework by Magnitude Conditioning

TL;DR

ConSep addresses robust speech separation under noise and reverberation by conditioning time-domain features on magnitude spectrogram information. It combines a learnable encoder and an STFT-based path with MulCA, modulated by FiLM using the magnitude, and uses a SepFormer-based mask estimator with a time-domain decoder. Across anechoic, noisy, and reverberant conditions, ConSep outperforms strong baselines and exhibits stability under various conditioning ablations, confirming the value of magnitude conditioning. The approach offers practical gains for real-world speech separation and is supported by qualitative visualizations that illuminate the advantages of focusing on informative frequency bands and harmonic structure.

Abstract

Speech separation has recently made significant progress thanks to the fine-grained vision used in time-domain methods. However, several studies have shown that adopting Short-Time Fourier Transform (STFT) for feature extraction could be beneficial when encountering harsher conditions, such as noise or reverberation. Therefore, we propose a magnitude-conditioned time-domain framework, ConSep, to inherit the beneficial characteristics. The experiment shows that ConSep promotes performance in anechoic, noisy, and reverberant settings compared to two celebrated methods, SepFormer and Bi-Sep. Furthermore, we visualize the components of ConSep to strengthen the advantages and cohere with the actualities we have found in preliminary studies.
Paper Structure (16 sections, 5 equations, 2 figures, 3 tables)

This paper contains 16 sections, 5 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Case studies. Generally, the rows indicate the spectrogram of mixture, sources, ConSep output, and SepFormer output from top to bottom. The two columns indicate the first and second sources from left to right. Also, red and blue boxes denote false alarm issues and spectral/harmony clarity. For (a) and (b), the non-speech signals cropped in the orange box are the sounds of inhaling and microphone pop, respectively.
  • Figure 2: For each sub-figure, the upper and lower panel depict the encoder bases sorted by Euclidean similarity and their frequency response, respectively.