Table of Contents
Fetching ...

Private kNN-VC: Interpretable Anonymization of Converted Speech

Carlos Franzreb, Arnab Das, Tim Polzehl, Sebastian Möller

TL;DR

This work investigates which speech facets leak speaker identity under anonymization and proposes an interpretable extension to kNN-VC by adding a phone predictor and a duration predictor, plus a phonetic-variation quantization stage. It demonstrates that both phone duration and phonetic variation encode speaker identity, with tighter control over phonetic variation delivering stronger privacy at modest utility cost; the approach also reveals a surprising sensitivity of privacy to target selection. Using the VPC 2024 framework, the authors show that constraining phonetic variation yields near-50% privacy (EER) while maintaining reasonable intelligibility and emotion preservation, and that target selection strategy can dramatically alter privacy outcomes. These findings inform practical design choices for privacy-preserving speech systems and highlight avenues for improving emotion fidelity and interpretability in anonymization schemes.

Abstract

Speaker anonymization seeks to conceal a speaker's identity while preserving the utility of their speech. The achieved privacy is commonly evaluated with a speaker recognition model trained on anonymized speech. Although this represents a strong attack, it is unclear which aspects of speech are exploited to identify the speakers. Our research sets out to unveil these aspects. It starts with kNN-VC, a powerful voice conversion model that performs poorly as an anonymization system, presumably because of prosody leakage. To test this hypothesis, we extend kNN-VC with two interpretable components that anonymize the duration and variation of phones. These components increase privacy significantly, proving that the studied prosodic factors encode speaker identity and are exploited by the privacy attack. Additionally, we show that changes in the target selection algorithm considerably influence the outcome of the privacy attack.

Private kNN-VC: Interpretable Anonymization of Converted Speech

TL;DR

This work investigates which speech facets leak speaker identity under anonymization and proposes an interpretable extension to kNN-VC by adding a phone predictor and a duration predictor, plus a phonetic-variation quantization stage. It demonstrates that both phone duration and phonetic variation encode speaker identity, with tighter control over phonetic variation delivering stronger privacy at modest utility cost; the approach also reveals a surprising sensitivity of privacy to target selection. Using the VPC 2024 framework, the authors show that constraining phonetic variation yields near-50% privacy (EER) while maintaining reasonable intelligibility and emotion preservation, and that target selection strategy can dramatically alter privacy outcomes. These findings inform practical design choices for privacy-preserving speech systems and highlight avenues for improving emotion fidelity and interpretability in anonymization schemes.

Abstract

Speaker anonymization seeks to conceal a speaker's identity while preserving the utility of their speech. The achieved privacy is commonly evaluated with a speaker recognition model trained on anonymized speech. Although this represents a strong attack, it is unclear which aspects of speech are exploited to identify the speakers. Our research sets out to unveil these aspects. It starts with kNN-VC, a powerful voice conversion model that performs poorly as an anonymization system, presumably because of prosody leakage. To test this hypothesis, we extend kNN-VC with two interpretable components that anonymize the duration and variation of phones. These components increase privacy significantly, proving that the studied prosodic factors encode speaker identity and are exploited by the privacy attack. Additionally, we show that changes in the target selection algorithm considerably influence the outcome of the privacy attack.

Paper Structure

This paper contains 16 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Private kNN-VC: purple components are the same as those of kNN-VC. $S$ is the source speech, $T$ is the target speaker's speech, and $A$ is the anonymized speech.
  • Figure 2: Privacy vs. intelligibility
  • Figure 3: Privacy vs. emotion preservation
  • Figure 4: No. of targets vs. privacy