Table of Contents
Fetching ...

Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation

Giuseppe Ruggiero, Matteo Testa, Jurgen Van de Walle, Luigi Di Caro

TL;DR

Eta-WavLM addresses the challenge of removing speaker identity from SSL speech representations by introducing a simple, offline linear disentanglement that decomposes SSL features into a speaker-dependent part and a speaker-independent eta component. The method learns a latent basis $A^*$ and bias $b^*$ from a large multi-speaker corpus and produces eta representations $\bm{\eta}$ through a straightforward linear inverse, avoiding retraining or quantization. Experiments show the eta representations measurably reduce speaker information in a speaker classification task and improve performance in a voice-conversion pipeline, outperforming several baselines across two target speakers. This approach offers an efficient, scalable way to obtain content-focused SSL features with potential applicability to multilingual settings and other downstream tasks such as ASR and expressive synthesis.

Abstract

Self-supervised learning (SSL) has reduced the reliance on expensive labeling in speech technologies by learning meaningful representations from unannotated data. Since most SSL-based downstream tasks prioritize content information in speech, ideal representations should disentangle content from unwanted variations like speaker characteristics in the SSL representations. However, removing speaker information often degrades other speech components, and existing methods either fail to fully disentangle speaker identity or require resource-intensive models. In this paper, we propose a novel disentanglement method that linearly decomposes SSL representations into speaker-specific and speaker-independent components, effectively generating speaker disentangled representations. Comprehensive experiments show that our approach achieves speaker independence and as such, when applied to content-driven tasks such as voice conversion, our representations yield significant improvements over state-of-the-art methods.

Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation

TL;DR

Eta-WavLM addresses the challenge of removing speaker identity from SSL speech representations by introducing a simple, offline linear disentanglement that decomposes SSL features into a speaker-dependent part and a speaker-independent eta component. The method learns a latent basis and bias from a large multi-speaker corpus and produces eta representations through a straightforward linear inverse, avoiding retraining or quantization. Experiments show the eta representations measurably reduce speaker information in a speaker classification task and improve performance in a voice-conversion pipeline, outperforming several baselines across two target speakers. This approach offers an efficient, scalable way to obtain content-focused SSL features with potential applicability to multilingual settings and other downstream tasks such as ASR and expressive synthesis.

Abstract

Self-supervised learning (SSL) has reduced the reliance on expensive labeling in speech technologies by learning meaningful representations from unannotated data. Since most SSL-based downstream tasks prioritize content information in speech, ideal representations should disentangle content from unwanted variations like speaker characteristics in the SSL representations. However, removing speaker information often degrades other speech components, and existing methods either fail to fully disentangle speaker identity or require resource-intensive models. In this paper, we propose a novel disentanglement method that linearly decomposes SSL representations into speaker-specific and speaker-independent components, effectively generating speaker disentangled representations. Comprehensive experiments show that our approach achieves speaker independence and as such, when applied to content-driven tasks such as voice conversion, our representations yield significant improvements over state-of-the-art methods.

Paper Structure

This paper contains 17 sections, 11 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: High-level overview of the proposed approach.
  • Figure 2: UMAP projections of the WavLM (a) and Eta-WavLM (b) representations extracted from 10 utterances of 5 speakers (with ids 1995, 2830, 4992, 61, 6829) from the LibriSpeech test-clean set.
  • Figure 3: PaCMAP projections of the WavLM (a) and Eta-WavLM (b) representations extracted from 10 utterances of 5 speakers (with ids 1995, 2830, 4992, 61, 6829) from the LibriSpeech test-clean set.