NTU-NPU System for Voice Privacy 2024 Challenge
Nikita Kuzmin, Hieu-Thi Luong, Jixun Yao, Lei Xie, Kong Aik Lee, Eng Siong Chng
TL;DR
The paper tackles preserving emotional content while anonymizing speaker identity in the Voice Privacy Challenge 2024. It extends two baselines, B3 and B5, by incorporating emotion embeddings, advanced speaker embedders (WavLM, ECAPA2), and a novel mean Reversion F0 transform, alongside disentanglement-based approaches with ß-VAE and NaturalSpeech3 FACodec. Key findings show that NS3 with cross-gender conversion and AWGN can enhance privacy without uniformly sacrificing utility, while Mean Reversion F0 and emotion embeddings offer useful privacy–utility tradeoffs; however, results are volatile and configuration-dependent. These insights advance practical voice privacy techniques by balancing emotion preservation, intelligibility, and speaker anonymity under realistic threat models.
Abstract
In this work, we describe our submissions for the Voice Privacy Challenge 2024. Rather than proposing a novel speech anonymization system, we enhance the provided baselines to meet all required conditions and improve evaluated metrics. Specifically, we implement emotion embedding and experiment with WavLM and ECAPA2 speaker embedders for the B3 baseline. Additionally, we compare different speaker and prosody anonymization techniques. Furthermore, we introduce Mean Reversion F0 for B5, which helps to enhance privacy without a loss in utility. Finally, we explore disentanglement models, namely $β$-VAE and NaturalSpeech3 FACodec.
