Table of Contents
Fetching ...

Context-Aware Two-Step Training Scheme for Domain Invariant Speech Separation

Wupeng Wang, Zexu Pan, Jingru Lin, Shuai Wang, Haizhou Li

TL;DR

The paper tackles domain mismatch in speech separation by proposing a context-aware two-stage training scheme that mirrors auditory processing. It decomposes models into a SIMO context extractor and a SISO segregator, guiding the extractor with InfoNCE-based contextual losses using SSL-derived targets (Mel, phoneme, and word) and then training the full model with SI-SDR to reconstruct target speech. Across synthetic and real datasets, the approach improves cross-domain performance and reduces WER, with word-level contextual representations providing the strongest gains; Hubed-based group-stage setups yield the best results. The work demonstrates robust domain transfer without explicit domain adaptation and points to SSL-informed contextual targets as a practical route for domain-invariant speech separation.

Abstract

Speech separation seeks to isolate individual speech signals from a multi-talk speech mixture. Despite much progress, a system well-trained on synthetic data often experiences performance degradation on out-of-domain data, such as real-world speech mixtures. To address this, we introduce a novel context-aware, two-stage training scheme for speech separation models. In this training scheme, the conventional end-to-end architecture is replaced with a framework that contains a context extractor and a segregator. The two modules are trained step by step to simulate the speech separation process of an auditory system. We evaluate the proposed training scheme through cross-domain experiments on both synthetic and real-world speech mixtures, and demonstrate that our new scheme effectively boosts separation quality across different domains without adaptation, as measured by signal quality metrics and word error rate (WER). Additionally, an ablation study on the real test set highlights that the context information, including phoneme and word representations from pretrained SSL models, serves as effective domain invariant training targets for separation models.

Context-Aware Two-Step Training Scheme for Domain Invariant Speech Separation

TL;DR

The paper tackles domain mismatch in speech separation by proposing a context-aware two-stage training scheme that mirrors auditory processing. It decomposes models into a SIMO context extractor and a SISO segregator, guiding the extractor with InfoNCE-based contextual losses using SSL-derived targets (Mel, phoneme, and word) and then training the full model with SI-SDR to reconstruct target speech. Across synthetic and real datasets, the approach improves cross-domain performance and reduces WER, with word-level contextual representations providing the strongest gains; Hubed-based group-stage setups yield the best results. The work demonstrates robust domain transfer without explicit domain adaptation and points to SSL-informed contextual targets as a practical route for domain-invariant speech separation.

Abstract

Speech separation seeks to isolate individual speech signals from a multi-talk speech mixture. Despite much progress, a system well-trained on synthetic data often experiences performance degradation on out-of-domain data, such as real-world speech mixtures. To address this, we introduce a novel context-aware, two-stage training scheme for speech separation models. In this training scheme, the conventional end-to-end architecture is replaced with a framework that contains a context extractor and a segregator. The two modules are trained step by step to simulate the speech separation process of an auditory system. We evaluate the proposed training scheme through cross-domain experiments on both synthetic and real-world speech mixtures, and demonstrate that our new scheme effectively boosts separation quality across different domains without adaptation, as measured by signal quality metrics and word error rate (WER). Additionally, an ablation study on the real test set highlights that the context information, including phoneme and word representations from pretrained SSL models, serves as effective domain invariant training targets for separation models.

Paper Structure

This paper contains 10 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: (A) illustrates a general SIMO-SISO framework for any speech separation model. (B) and (C) are schematic diagrams of the two training stages. $\boldsymbol{x}$ is the mixture waveform. $\mathbf{E}_m$ is the latent embedding extracted by the encoder. $\mathbf{C}_1$ and $\mathbf{C}_2$ are the predicted contextual embedding. $\tilde{\mathbf{U}}_1$ and $\tilde{\mathbf{U}}_2$ are the estimated contextual representations, and $\mathbf{\mathbf{U}}_1$ and $\mathbf{\mathbf{U}}_2$ are the ground-truth contextual representations. $\mathbf{M}_1$ and $\mathbf{M}_2$ are the latent mask from the segregator $h_{\theta}$. The $\hat{\boldsymbol{s}}_1$ and $\hat{\boldsymbol{s}}_2$ are the estimated signal, and the $\boldsymbol{s}_1$ and $\boldsymbol{s}_2$ are the target reference speech.
  • Figure 2: The automatic speech recognition results in terms of word error rate (WER %) for ConvTasNet separator that is trained on the LM2Mix with various supervisory targets and tested on the REAL-M test set.