Table of Contents
Fetching ...

Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction

Zhaoxi Mu, Xinyu Yang, Sining Sun, Qing Yang

TL;DR

This work tackles speaker confusion in target speech extraction by proposing SDR-TSE, a two-phase, self-supervised disentangled representation learning framework. It decomposes reference speech into semantic and global components via a reference speech encoding network (RSEN) and refines global information into a pure speaker embedding with a global information disentanglement network (GIDN); an adaptive modulation Transformer (AM-Transformer) then conditions the SEN with a disentangled speaker cue without distorting the acoustic signal. Key contributions include a self-supervised RSEN with variational objectives and mutual-information regularization, a channel-attention-based GIDN with a contrastive SIM loss, and an AMLN-based fusion mechanism that preserves content while enhancing speaker cue perception. Experiments on WSJ0-2mix and WSJ0-2mix-extr demonstrate state-of-the-art SI-SNRi/SDRi/PESQ scores and a substantial reduction in the chunk-level speaker-confusion metric r_scr, highlighting the practical impact for robust, real-world TSE without requiring speaker labels.

Abstract

Speech signals are inherently complex as they encompass both global acoustic characteristics and local semantic information. However, in the task of target speech extraction, certain elements of global and local semantic information in the reference speech, which are irrelevant to speaker identity, can lead to speaker confusion within the speech extraction network. To overcome this challenge, we propose a self-supervised disentangled representation learning method. Our approach tackles this issue through a two-phase process, utilizing a reference speech encoding network and a global information disentanglement network to gradually disentangle the speaker identity information from other irrelevant factors. We exclusively employ the disentangled speaker identity information to guide the speech extraction network. Moreover, we introduce the adaptive modulation Transformer to ensure that the acoustic representation of the mixed signal remains undisturbed by the speaker embeddings. This component incorporates speaker embeddings as conditional information, facilitating natural and efficient guidance for the speech extraction network. Experimental results substantiate the effectiveness of our meticulously crafted approach, showcasing a substantial reduction in the likelihood of speaker confusion.

Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction

TL;DR

This work tackles speaker confusion in target speech extraction by proposing SDR-TSE, a two-phase, self-supervised disentangled representation learning framework. It decomposes reference speech into semantic and global components via a reference speech encoding network (RSEN) and refines global information into a pure speaker embedding with a global information disentanglement network (GIDN); an adaptive modulation Transformer (AM-Transformer) then conditions the SEN with a disentangled speaker cue without distorting the acoustic signal. Key contributions include a self-supervised RSEN with variational objectives and mutual-information regularization, a channel-attention-based GIDN with a contrastive SIM loss, and an AMLN-based fusion mechanism that preserves content while enhancing speaker cue perception. Experiments on WSJ0-2mix and WSJ0-2mix-extr demonstrate state-of-the-art SI-SNRi/SDRi/PESQ scores and a substantial reduction in the chunk-level speaker-confusion metric r_scr, highlighting the practical impact for robust, real-world TSE without requiring speaker labels.

Abstract

Speech signals are inherently complex as they encompass both global acoustic characteristics and local semantic information. However, in the task of target speech extraction, certain elements of global and local semantic information in the reference speech, which are irrelevant to speaker identity, can lead to speaker confusion within the speech extraction network. To overcome this challenge, we propose a self-supervised disentangled representation learning method. Our approach tackles this issue through a two-phase process, utilizing a reference speech encoding network and a global information disentanglement network to gradually disentangle the speaker identity information from other irrelevant factors. We exclusively employ the disentangled speaker identity information to guide the speech extraction network. Moreover, we introduce the adaptive modulation Transformer to ensure that the acoustic representation of the mixed signal remains undisturbed by the speaker embeddings. This component incorporates speaker embeddings as conditional information, facilitating natural and efficient guidance for the speech extraction network. Experimental results substantiate the effectiveness of our meticulously crafted approach, showcasing a substantial reduction in the likelihood of speaker confusion.
Paper Structure (15 sections, 24 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 15 sections, 24 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Schematic depicting information disentanglement of reference speech.
  • Figure 2: The architecture of the SDR-TSE. (a), (b) and (c) depict the speech extraction network, global information disentanglement network and reference speech encoding network. The semantic information encoder $E_c$ and spectrogram decoder $D$ within the dashed box are utilized solely for training purposes to facilitate disentanglement and discarded during inference. MIM refers to mutual information minimization. The red channels in the feature map of the GIDN indicate activated channels containing speaker identity information, while blue channels represent suppressed channels containing harmful information.
  • Figure 3: 2-D visualization of the spatial distribution of $z_c$, $z_g$, and $z_s$ for the reference speech of five different speakers on the WSJ0-2mix-extr test set. Speakers are labelled as M (male) and F (female).