StreamVoiceAnon+: Emotion-Preserving Streaming Speaker Anonymization via Frame-Level Acoustic Distillation

Nikita Kuzmin; Kong Aik Lee; Eng Siong Chng

StreamVoiceAnon+: Emotion-Preserving Streaming Speaker Anonymization via Frame-Level Acoustic Distillation

Nikita Kuzmin, Kong Aik Lee, Eng Siong Chng

Abstract

We address the challenge of preserving emotional content in streaming speaker anonymization (SA). Neural audio codec language models trained for audio continuation tend to degrade source emotion: content tokens discard emotional information, and the model defaults to dominant acoustic patterns rather than preserving paralinguistic attributes. We propose supervised finetuning with neutral-emotion utterance pairs from the same speaker, combined with frame-level emotion distillation on acoustic token hidden states. All modifications are confined to finetuning, which takes less than 2 hours on 4 GPUs and adds zero inference latency overhead, while maintaining a competitive 180ms streaming latency. On the VoicePrivacy 2024 protocol, our approach achieves a 49.2% UAR (emotion preservation) with 5.77% WER (intelligibility), a +24% relative UAR improvement over the baseline (39.7%->49.2%) and +10% over the emotion-prompt variant (44.6% UAR), while maintaining strong privacy (EER 49.0%). Demo and code are available: https://anonymous3842031239.github.io/

StreamVoiceAnon+: Emotion-Preserving Streaming Speaker Anonymization via Frame-Level Acoustic Distillation

Abstract

Paper Structure (12 sections, 2 equations, 2 figures, 2 tables)

This paper contains 12 sections, 2 equations, 2 figures, 2 tables.

Introduction
Related Work
Proposed Method
Problem Statement
Supervised Finetuning with Neutral-Emotion Pairs
Frame-Level Emotion Distillation
Experiments
Setup
Main Results
Ablation Study
Conclusion
Generative AI Use Disclosure

Figures (2)

Figure 1: Privacy-emotion trade-off for streaming speaker anonymization methods. Our method (orange star) compared to prior streaming methods (triangles).
Figure 2: Training and inference configurations. Training: prompt and source share the same speaker but differ in emotion, forcing the model to generate emotional output from source content rather than copying prompt-specific patterns. Frame-level emotion distillation ($\mathcal{L}_{\text{emo}}$, dashed) on Slow AR acoustic hidden states provides additional learning signal. Inference: a neutral utterance from the target anonymous speaker conceals source identity while the finetuned model preserves source emotion; no latency is added over the baseline. The transformer $f_\theta$ is depicted twice for visual clarity; weights are shared.

StreamVoiceAnon+: Emotion-Preserving Streaming Speaker Anonymization via Frame-Level Acoustic Distillation

Abstract

StreamVoiceAnon+: Emotion-Preserving Streaming Speaker Anonymization via Frame-Level Acoustic Distillation

Authors

Abstract

Table of Contents

Figures (2)