HLTCOE JHU Submission to the Voice Privacy Challenge 2024

Henry Li Xinyuan; Zexin Cai; Ashi Garg; Kevin Duh; Leibny Paola García-Perera; Sanjeev Khudanpur; Nicholas Andrews; Matthew Wiesner

HLTCOE JHU Submission to the Voice Privacy Challenge 2024

Henry Li Xinyuan, Zexin Cai, Ashi Garg, Kevin Duh, Leibny Paola García-Perera, Sanjeev Khudanpur, Nicholas Andrews, Matthew Wiesner

TL;DR

The paper investigates voice privacy by comparing two anonymization paradigms: voice conversion (kNN-VC and WavLM-based) and cascaded ASR–TTS (Whisper–VITS). It reveals that voice-conversion approaches better preserve emotion but struggle under strong attacker models, while cascaded TTS achieves stronger anonymization at the expense of emotional content. A randomized admixture of the two approaches is proposed to balance privacy and utility, achieving strong privacy (EER around 40–50%) with competitive emotion preservation (UAR around the mid-40s to mid-50s). The findings offer a practical path to tunable voice anonymization suitable for varied privacy and usability requirements, highlighting the trade-offs between identity concealment and para-linguistic fidelity. Future work should explore equilibria with adversaries, provide controllable generation in TTS, and enhance preservation of para-linguistic features.

Abstract

We present a number of systems for the Voice Privacy Challenge, including voice conversion based systems such as the kNN-VC method and the WavLM voice Conversion method, and text-to-speech (TTS) based systems including Whisper-VITS. We found that while voice conversion systems better preserve emotional content, they struggle to conceal speaker identity in semi-white-box attack scenarios; conversely, TTS methods perform better at anonymization and worse at emotion preservation. Finally, we propose a random admixture system which seeks to balance out the strengths and weaknesses of the two category of systems, achieving a strong EER of over 40% while maintaining UAR at a respectable 47%.

HLTCOE JHU Submission to the Voice Privacy Challenge 2024

TL;DR

Abstract

Paper Structure (18 sections, 4 figures, 1 table)

This paper contains 18 sections, 4 figures, 1 table.

Introduction
Method
Our Baseline: kNN-VC
WavLM Conversion
Joint training with target reconstruction objective
Joint adversarial training
Discretized and aligned objective
Cascaded Anonymization
Random Admixture
Experiments and Results
Datasets
Evaluation Metrics
Privacy
Utility
Results
...and 3 more sections

Figures (4)

Figure 1: Schematic of our adapted kNN-VC system.
Figure 2: Schematic of our WavLM conversion system with k-means discrete loss. The training targets (on the left hand side) are discretized using k-means clustering. The resulting token sequence is used as the golden target labels during CTC loss calculation.
Figure 3: Cascaded ASR-TTS Anonymization Process
Figure 4: Scatter plot of various results from the Random Admixture system. Results from the two source systems, cascaded TTS and kNN-VC, are included. Each point is labeled and color-coded with the percentage of the admixture which was drawn from the cascaded TTS system.

HLTCOE JHU Submission to the Voice Privacy Challenge 2024

TL;DR

Abstract

HLTCOE JHU Submission to the Voice Privacy Challenge 2024

Authors

TL;DR

Abstract

Table of Contents

Figures (4)