Table of Contents
Fetching ...

Asymmetric Encoder-Decoder Based on Time-Frequency Correlation for Speech Separation

Ui-Hyeop Shin, Hyung-Min Park

Abstract

Speech separation in realistic acoustic environments remains challenging because overlapping speakers, background noise, and reverberation must be resolved simultaneously. Although recent time-frequency (TF) domain models have shown strong performance, most still rely on late-split architectures, where speaker disentanglement is deferred to the final stage, creating an information bottleneck and weakening discriminability under adverse conditions. To address this issue, we propose SR-CorrNet, an asymmetric encoder-decoder framework that introduces the separation-reconstruction (SepRe) strategy into a TF dual-path backbone. The encoder performs coarse separation from mixture observations, while the weight-shared decoder progressively reconstructs speaker-discriminative features with cross-speaker interaction, enabling stage-wise refinement. To complement this architecture, we formulate speech separation as a structured correlation-to-filter problem: spatio-spectro-temporal correlations computed from the observations are used as input features, and the corresponding deep filters are estimated to recover target signals. We further incorporate an attractor-based dynamic split module to adapt the number of output streams to the actual speaker configuration. Experimental results on WSJ0-2/3/4/5Mix, WHAMR!, and LibriCSS demonstrate consistent improvements across anechoic, noisy-reverberant, and real-recorded conditions in both single- and multi-channel settings, highlighting the effectiveness of TF-domain SepRe with correlation-based filter estimation for speech separation.

Asymmetric Encoder-Decoder Based on Time-Frequency Correlation for Speech Separation

Abstract

Speech separation in realistic acoustic environments remains challenging because overlapping speakers, background noise, and reverberation must be resolved simultaneously. Although recent time-frequency (TF) domain models have shown strong performance, most still rely on late-split architectures, where speaker disentanglement is deferred to the final stage, creating an information bottleneck and weakening discriminability under adverse conditions. To address this issue, we propose SR-CorrNet, an asymmetric encoder-decoder framework that introduces the separation-reconstruction (SepRe) strategy into a TF dual-path backbone. The encoder performs coarse separation from mixture observations, while the weight-shared decoder progressively reconstructs speaker-discriminative features with cross-speaker interaction, enabling stage-wise refinement. To complement this architecture, we formulate speech separation as a structured correlation-to-filter problem: spatio-spectro-temporal correlations computed from the observations are used as input features, and the corresponding deep filters are estimated to recover target signals. We further incorporate an attractor-based dynamic split module to adapt the number of output streams to the actual speaker configuration. Experimental results on WSJ0-2/3/4/5Mix, WHAMR!, and LibriCSS demonstrate consistent improvements across anechoic, noisy-reverberant, and real-recorded conditions in both single- and multi-channel settings, highlighting the effectiveness of TF-domain SepRe with correlation-based filter estimation for speech separation.

Paper Structure

This paper contains 39 sections, 19 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Block diagrams of (a) Late-split and (b) Early-split schemes.
  • Figure 2: Illustration of two-stage multi-channel separation structure.
  • Figure 3: Architecture of SR-CorrNet. The multi-channel multi-frame observation $\tilde{\mathbf{x}}_{tf}\in \mathbb{R}^{2L+1}$ is used for the computation of correlation and output filtering. From $\tilde{\mathbf{x}}_{tf}$, the Correlation module computes latent representation of correlations $\mathbf{E}_{tf}^{(0)}\in\mathbb{R}^{C}$ as an input for TF-Encoder. Then, $B_E$ stacks of TF-Encoder process the features to $\mathbf{E}_{tf}^{(B_E)}$ and split into speaker-specific features $\mathbf{D}_{k,tf}^{(0)}$ for the TF-Decoder. In the TF-Decoder, the separated features are reconstructed $B_D$ times and the reconstructed features $\mathbf{D}_{k,tf}^{(B_D)}$ are used to predict final multi-channel multi-tap filter $\mathbf{w}_{k,tf}$ for $Y_{k,tf}$.
  • Figure 4: Block diagrams of (a) Correlation module (b) Filter module. In the Correlation module, the correlation operator performs correlation operation of (\ref{['eq:corr_MISO']}) with normalization by (\ref{['eq:PHAT']}) or (\ref{['eq:SCOT']}). Then, the complex correlations are flattend to the real-valued vector of $\mathbb{R}^{2M(2L+1)(2I+1)}$.
  • Figure 5: Block diagrams of (a) Common unit module for Time and Frequency module and (b) speaker interaction module.
  • ...and 2 more figures