Speaker anonymization using neural audio codec language models
Michele Panariello, Francesco Nespoli, Massimiliano Todisco, Nicholas Evans
TL;DR
The paper tackles the leakage of speaker identity in traditional x-vector–based anonymization pipelines by introducing a neural audio codec (NAC) language-model approach. It uses a semantic encoder to extract content tokens and a hierarchical NAC (with coarse and fine codebooks) to generate acoustic tokens, which are then swapped with tokens from a pool of pseudo-speakers to produce anonymized speech while preserving meaning. Through a dual-transformer (coarse autoregressive and fine non-autoregressive) setup, the system bottlenecks speaker information via discretized codes, achieving strong privacy (higher EER) at the cost of some degradation in automatic transcription (WER) but preserving prosody to a large extent ($\rho^{F0}\approx0.7$, $G_{VD}\approx-2$). Evaluated under the Voice Privacy Challenge 2022 on LibriSpeech and VCTK, the NAC-based method outperforms prior approaches in privacy while maintaining acceptable intelligibility, highlighting NAC language modeling as a promising direction for privacy-preserving speech synthesis. Future work should focus on enhancing utility with high-quality prompts or utility-preserving constraints to strengthen the privacy–utility trade-off in practical deployments.
Abstract
The vast majority of approaches to speaker anonymization involve the extraction of fundamental frequency estimates, linguistic features and a speaker embedding which is perturbed to obfuscate the speaker identity before an anonymized speech waveform is resynthesized using a vocoder. Recent work has shown that x-vector transformations are difficult to control consistently: other sources of speaker information contained within fundamental frequency and linguistic features are re-entangled upon vocoding, meaning that anonymized speech signals still contain speaker information. We propose an approach based upon neural audio codecs (NACs), which are known to generate high-quality synthetic speech when combined with language models. NACs use quantized codes, which are known to effectively bottleneck speaker-related information: we demonstrate the potential of speaker anonymization systems based on NAC language modeling by applying the evaluation framework of the Voice Privacy Challenge 2022.
