Removing Speaker Information from Speech Representation using Variable-Length Soft Pooling
Injune Hwang, Kyogu Lee
TL;DR
The work tackles speaker information leakage in self-supervised speech representations by introducing a boundary-guided, variable-length soft pooling mechanism that yields event-based, phoneme-aligned representations. A boundary predictor and a dual training objective—Contrastive Predictive Coding ($L_{CPC}$) and a boundary-aware contrastive loss ($L_{contr}$)—drive the model to preserve linguistic content while suppressing speaker cues, with augmentations via time-stretch and pitch-shift providing positive samples from altered boundaries. The Soft Pooling Module uses a Gaussian-attention mechanism over predicted boundaries to produce downsampled representations, and the combined losses encourage alignment of content across augmented views. Experiments on LibriSpeech, ABX, SID, and TIMIT demonstrate improved phonetic content retention and reduced speaker information, with boundary predictions closely matching phoneme boundaries and segmentation performance around 74.15% F1 on phoneme boundaries.
Abstract
Recently, there have been efforts to encode the linguistic information of speech using a self-supervised framework for speech synthesis. However, predicting representations from surrounding representations can inadvertently entangle speaker information in the speech representation. This paper aims to remove speaker information by exploiting the structured nature of speech, composed of discrete units like phonemes with clear boundaries. A neural network predicts these boundaries, enabling variable-length pooling for event-based representation extraction instead of fixed-rate methods. The boundary predictor outputs a probability for the boundary between 0 and 1, making pooling soft. The model is trained to minimize the difference with the pooled representation of the data augmented by time-stretch and pitch-shift. To confirm that the learned representation includes contents information but is independent of speaker information, the model was evaluated with libri-light's phonetic ABX task and SUPERB's speaker identification task.
