HLTCOE JHU Submission to the Voice Privacy Challenge 2024
Henry Li Xinyuan, Zexin Cai, Ashi Garg, Kevin Duh, Leibny Paola García-Perera, Sanjeev Khudanpur, Nicholas Andrews, Matthew Wiesner
TL;DR
The paper investigates voice privacy by comparing two anonymization paradigms: voice conversion (kNN-VC and WavLM-based) and cascaded ASR–TTS (Whisper–VITS). It reveals that voice-conversion approaches better preserve emotion but struggle under strong attacker models, while cascaded TTS achieves stronger anonymization at the expense of emotional content. A randomized admixture of the two approaches is proposed to balance privacy and utility, achieving strong privacy (EER around 40–50%) with competitive emotion preservation (UAR around the mid-40s to mid-50s). The findings offer a practical path to tunable voice anonymization suitable for varied privacy and usability requirements, highlighting the trade-offs between identity concealment and para-linguistic fidelity. Future work should explore equilibria with adversaries, provide controllable generation in TTS, and enhance preservation of para-linguistic features.
Abstract
We present a number of systems for the Voice Privacy Challenge, including voice conversion based systems such as the kNN-VC method and the WavLM voice Conversion method, and text-to-speech (TTS) based systems including Whisper-VITS. We found that while voice conversion systems better preserve emotional content, they struggle to conceal speaker identity in semi-white-box attack scenarios; conversely, TTS methods perform better at anonymization and worse at emotion preservation. Finally, we propose a random admixture system which seeks to balance out the strengths and weaknesses of the two category of systems, achieving a strong EER of over 40% while maintaining UAR at a respectable 47%.
