Table of Contents
Fetching ...

Speaker anonymization using neural audio codec language models

Michele Panariello, Francesco Nespoli, Massimiliano Todisco, Nicholas Evans

TL;DR

The paper tackles the leakage of speaker identity in traditional x-vector–based anonymization pipelines by introducing a neural audio codec (NAC) language-model approach. It uses a semantic encoder to extract content tokens and a hierarchical NAC (with coarse and fine codebooks) to generate acoustic tokens, which are then swapped with tokens from a pool of pseudo-speakers to produce anonymized speech while preserving meaning. Through a dual-transformer (coarse autoregressive and fine non-autoregressive) setup, the system bottlenecks speaker information via discretized codes, achieving strong privacy (higher EER) at the cost of some degradation in automatic transcription (WER) but preserving prosody to a large extent ($\rho^{F0}\approx0.7$, $G_{VD}\approx-2$). Evaluated under the Voice Privacy Challenge 2022 on LibriSpeech and VCTK, the NAC-based method outperforms prior approaches in privacy while maintaining acceptable intelligibility, highlighting NAC language modeling as a promising direction for privacy-preserving speech synthesis. Future work should focus on enhancing utility with high-quality prompts or utility-preserving constraints to strengthen the privacy–utility trade-off in practical deployments.

Abstract

The vast majority of approaches to speaker anonymization involve the extraction of fundamental frequency estimates, linguistic features and a speaker embedding which is perturbed to obfuscate the speaker identity before an anonymized speech waveform is resynthesized using a vocoder. Recent work has shown that x-vector transformations are difficult to control consistently: other sources of speaker information contained within fundamental frequency and linguistic features are re-entangled upon vocoding, meaning that anonymized speech signals still contain speaker information. We propose an approach based upon neural audio codecs (NACs), which are known to generate high-quality synthetic speech when combined with language models. NACs use quantized codes, which are known to effectively bottleneck speaker-related information: we demonstrate the potential of speaker anonymization systems based on NAC language modeling by applying the evaluation framework of the Voice Privacy Challenge 2022.

Speaker anonymization using neural audio codec language models

TL;DR

The paper tackles the leakage of speaker identity in traditional x-vector–based anonymization pipelines by introducing a neural audio codec (NAC) language-model approach. It uses a semantic encoder to extract content tokens and a hierarchical NAC (with coarse and fine codebooks) to generate acoustic tokens, which are then swapped with tokens from a pool of pseudo-speakers to produce anonymized speech while preserving meaning. Through a dual-transformer (coarse autoregressive and fine non-autoregressive) setup, the system bottlenecks speaker information via discretized codes, achieving strong privacy (higher EER) at the cost of some degradation in automatic transcription (WER) but preserving prosody to a large extent (, ). Evaluated under the Voice Privacy Challenge 2022 on LibriSpeech and VCTK, the NAC-based method outperforms prior approaches in privacy while maintaining acceptable intelligibility, highlighting NAC language modeling as a promising direction for privacy-preserving speech synthesis. Future work should focus on enhancing utility with high-quality prompts or utility-preserving constraints to strengthen the privacy–utility trade-off in practical deployments.

Abstract

The vast majority of approaches to speaker anonymization involve the extraction of fundamental frequency estimates, linguistic features and a speaker embedding which is perturbed to obfuscate the speaker identity before an anonymized speech waveform is resynthesized using a vocoder. Recent work has shown that x-vector transformations are difficult to control consistently: other sources of speaker information contained within fundamental frequency and linguistic features are re-entangled upon vocoding, meaning that anonymized speech signals still contain speaker information. We propose an approach based upon neural audio codecs (NACs), which are known to generate high-quality synthetic speech when combined with language models. NACs use quantized codes, which are known to effectively bottleneck speaker-related information: we demonstrate the potential of speaker anonymization systems based on NAC language modeling by applying the evaluation framework of the Voice Privacy Challenge 2022.
Paper Structure (8 sections, 2 equations, 1 figure, 1 table)

This paper contains 8 sections, 2 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Diagram of the proposed anonymization system.