Table of Contents
Fetching ...

Speaker Disentanglement of Speech Pre-trained Model Based on Interpretability

Xiaoxu Zhu, Junhua Li, Aaron J. Li, Yiming Ren, Baoxiang Li

TL;DR

The paper addresses the problem of content–speaker entanglement in self-supervised speech representations, which hurts ASR and raises privacy concerns. It proposes a SHAP-based direct benchmark (InterpTRQE-SptME) to quantify residual speaker information in content embeddings and a post-hoc interpretability-based filter (InterpTF-SptME) with SHAP Noise and SHAP Cropping to remove speaker cues without retraining. Across seven models on the VCTK dataset, residual speaker information ranges from about 5% to nearly 19%, with ContentVec showing the best disentanglement; SHAP Noise can reduce residuals to near zero with modest CTC loss increases, especially at an optimal noise level. The approach is model-agnostic and provides actionable, post-hoc privacy improvements for speech representations, with insights into layer-wise speaker information distribution and practical filtering strategies.

Abstract

Self-supervised speech models learn representations that capture both content and speaker information. Yet this entanglement creates problems: content tasks suffer from speaker bias, and privacy concerns arise when speaker identity leaks through supposedly anonymized representations. We present two contributions to address these challenges. First, we develop InterpTRQE-SptME (Timbre Residual Quantitative Evaluation Benchmark of Speech pre-training Models Encoding via Interpretability), a benchmark that directly measures residual speaker information in content embeddings using SHAP-based interpretability analysis. Unlike existing indirect metrics, our approach quantifies the exact proportion of speaker information remaining after disentanglement. Second, we propose InterpTF-SptME, which uses these interpretability insights to filter speaker information from embeddings. Testing on VCTK with seven models including HuBERT, WavLM, and ContentVec, we find that SHAP Noise filtering reduces speaker residuals from 18.05% to nearly zero while maintaining recognition accuracy (CTC loss increase under 1%). The method is model-agnostic and requires no retraining.

Speaker Disentanglement of Speech Pre-trained Model Based on Interpretability

TL;DR

The paper addresses the problem of content–speaker entanglement in self-supervised speech representations, which hurts ASR and raises privacy concerns. It proposes a SHAP-based direct benchmark (InterpTRQE-SptME) to quantify residual speaker information in content embeddings and a post-hoc interpretability-based filter (InterpTF-SptME) with SHAP Noise and SHAP Cropping to remove speaker cues without retraining. Across seven models on the VCTK dataset, residual speaker information ranges from about 5% to nearly 19%, with ContentVec showing the best disentanglement; SHAP Noise can reduce residuals to near zero with modest CTC loss increases, especially at an optimal noise level. The approach is model-agnostic and provides actionable, post-hoc privacy improvements for speech representations, with insights into layer-wise speaker information distribution and practical filtering strategies.

Abstract

Self-supervised speech models learn representations that capture both content and speaker information. Yet this entanglement creates problems: content tasks suffer from speaker bias, and privacy concerns arise when speaker identity leaks through supposedly anonymized representations. We present two contributions to address these challenges. First, we develop InterpTRQE-SptME (Timbre Residual Quantitative Evaluation Benchmark of Speech pre-training Models Encoding via Interpretability), a benchmark that directly measures residual speaker information in content embeddings using SHAP-based interpretability analysis. Unlike existing indirect metrics, our approach quantifies the exact proportion of speaker information remaining after disentanglement. Second, we propose InterpTF-SptME, which uses these interpretability insights to filter speaker information from embeddings. Testing on VCTK with seven models including HuBERT, WavLM, and ContentVec, we find that SHAP Noise filtering reduces speaker residuals from 18.05% to nearly zero while maintaining recognition accuracy (CTC loss increase under 1%). The method is model-agnostic and requires no retraining.

Paper Structure

This paper contains 12 sections, 8 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of our framework combining InterpTRQE-SptME benchmark (left) for quantifying speaker residuals and InterpTF-SptME filtering methods (right) for removing speaker information from content embeddings.
  • Figure 2: SHAP value distribution comparison showing speaker (orange) vs content (blue) contributions. ContentVec shows significantly reduced speaker contribution.
  • Figure 3: Trade-off between speaker residual reduction (Mean Score) and content preservation (CTC Loss) for SHAP Noise filtering. $\sigma$ controls noise scale (negative values indicate proportion of SHAP-weighted noise added).