Table of Contents
Fetching ...

Removing Speaker Information from Speech Representation using Variable-Length Soft Pooling

Injune Hwang, Kyogu Lee

TL;DR

The work tackles speaker information leakage in self-supervised speech representations by introducing a boundary-guided, variable-length soft pooling mechanism that yields event-based, phoneme-aligned representations. A boundary predictor and a dual training objective—Contrastive Predictive Coding ($L_{CPC}$) and a boundary-aware contrastive loss ($L_{contr}$)—drive the model to preserve linguistic content while suppressing speaker cues, with augmentations via time-stretch and pitch-shift providing positive samples from altered boundaries. The Soft Pooling Module uses a Gaussian-attention mechanism over predicted boundaries to produce downsampled representations, and the combined losses encourage alignment of content across augmented views. Experiments on LibriSpeech, ABX, SID, and TIMIT demonstrate improved phonetic content retention and reduced speaker information, with boundary predictions closely matching phoneme boundaries and segmentation performance around 74.15% F1 on phoneme boundaries.

Abstract

Recently, there have been efforts to encode the linguistic information of speech using a self-supervised framework for speech synthesis. However, predicting representations from surrounding representations can inadvertently entangle speaker information in the speech representation. This paper aims to remove speaker information by exploiting the structured nature of speech, composed of discrete units like phonemes with clear boundaries. A neural network predicts these boundaries, enabling variable-length pooling for event-based representation extraction instead of fixed-rate methods. The boundary predictor outputs a probability for the boundary between 0 and 1, making pooling soft. The model is trained to minimize the difference with the pooled representation of the data augmented by time-stretch and pitch-shift. To confirm that the learned representation includes contents information but is independent of speaker information, the model was evaluated with libri-light's phonetic ABX task and SUPERB's speaker identification task.

Removing Speaker Information from Speech Representation using Variable-Length Soft Pooling

TL;DR

The work tackles speaker information leakage in self-supervised speech representations by introducing a boundary-guided, variable-length soft pooling mechanism that yields event-based, phoneme-aligned representations. A boundary predictor and a dual training objective—Contrastive Predictive Coding () and a boundary-aware contrastive loss ()—drive the model to preserve linguistic content while suppressing speaker cues, with augmentations via time-stretch and pitch-shift providing positive samples from altered boundaries. The Soft Pooling Module uses a Gaussian-attention mechanism over predicted boundaries to produce downsampled representations, and the combined losses encourage alignment of content across augmented views. Experiments on LibriSpeech, ABX, SID, and TIMIT demonstrate improved phonetic content retention and reduced speaker information, with boundary predictions closely matching phoneme boundaries and segmentation performance around 74.15% F1 on phoneme boundaries.

Abstract

Recently, there have been efforts to encode the linguistic information of speech using a self-supervised framework for speech synthesis. However, predicting representations from surrounding representations can inadvertently entangle speaker information in the speech representation. This paper aims to remove speaker information by exploiting the structured nature of speech, composed of discrete units like phonemes with clear boundaries. A neural network predicts these boundaries, enabling variable-length pooling for event-based representation extraction instead of fixed-rate methods. The boundary predictor outputs a probability for the boundary between 0 and 1, making pooling soft. The model is trained to minimize the difference with the pooled representation of the data augmented by time-stretch and pitch-shift. To confirm that the learned representation includes contents information but is independent of speaker information, the model was evaluated with libri-light's phonetic ABX task and SUPERB's speaker identification task.
Paper Structure (12 sections, 3 equations, 4 figures, 2 tables)

This paper contains 12 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The concept of soft pooling
  • Figure 2: Overall architecture of the model. $g_{enc}$, $g_{ar}$ are feature extractor and autoregressive network for extracting context vectors, respectively. Contrastive loss is calculated from the pooled representations of the original data and the augmented data. The blue solid line, the blue dotted line, and the red dotted line represent the anchor, positive sample, and negative sample, respectively.
  • Figure 3: Description of soft pooling module.
  • Figure 4: The three figures above are the mel-spectrogram with phoneme boundaries, predicted boundaries, and unnormalized attention weights of the original data, respectively, and the three figure below shows the same components for augmented data. Sample data is SI943 of FAKS0 from TIMIT.