Table of Contents
Fetching ...

Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation

Xiaoxiao Miao, Yuxiang Zhang, Xin Wang, Natalia Tomashenko, Donny Cheng Lock Soh, Ian Mcloughlin

TL;DR

This paper tackles preserving emotion in disentanglement-based speaker anonymization, addressing privacy-utility trade-offs in VoicePrivacy 2024. It introduces two practical strategies: (i) integrating pre-trained emotion embeddings to retain emotional cues and (ii) a post-processing emotion compensation step that nudges anonymized speaker embeddings along learned emotion boundaries derived via SVMs. Empirical results show that emotion compensation (and especially when combined with an emotion encoder) substantially improves emotion preservation (UAR), while maintaining competitive privacy (EER) and speech content (WER), with P2 offering the best privacy-utility balance and P3 achieving the strongest emotion retention. The work demonstrates the feasibility of adapting general disentanglement-based SAS to preserve target paralinguistic attributes and highlights potential extensions to other attributes, enabling tailored downstream task performance.

Abstract

A general disentanglement-based speaker anonymization system typically separates speech into content, speaker, and prosody features using individual encoders. This paper explores how to adapt such a system when a new speech attribute, for example, emotion, needs to be preserved to a greater extent. While existing systems are good at anonymizing speaker embeddings, they are not designed to preserve emotion. Two strategies for this are examined. First, we show that integrating emotion embeddings from a pre-trained emotion encoder can help preserve emotional cues, even though this approach slightly compromises privacy protection. Alternatively, we propose an emotion compensation strategy as a post-processing step applied to anonymized speaker embeddings. This conceals the original speaker's identity and reintroduces the emotional traits lost during speaker embedding anonymization. Specifically, we model the emotion attribute using support vector machines to learn separate boundaries for each emotion. During inference, the original speaker embedding is processed in two ways: one, by an emotion indicator to predict emotion and select the emotion-matched SVM accurately; and two, by a speaker anonymizer to conceal speaker characteristics. The anonymized speaker embedding is then modified along the corresponding SVM boundary towards an enhanced emotional direction to save the emotional cues. The proposed strategies are also expected to be useful for adapting a general disentanglement-based speaker anonymization system to preserve other target paralinguistic attributes, with potential for a range of downstream tasks.

Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation

TL;DR

This paper tackles preserving emotion in disentanglement-based speaker anonymization, addressing privacy-utility trade-offs in VoicePrivacy 2024. It introduces two practical strategies: (i) integrating pre-trained emotion embeddings to retain emotional cues and (ii) a post-processing emotion compensation step that nudges anonymized speaker embeddings along learned emotion boundaries derived via SVMs. Empirical results show that emotion compensation (and especially when combined with an emotion encoder) substantially improves emotion preservation (UAR), while maintaining competitive privacy (EER) and speech content (WER), with P2 offering the best privacy-utility balance and P3 achieving the strongest emotion retention. The work demonstrates the feasibility of adapting general disentanglement-based SAS to preserve target paralinguistic attributes and highlights potential extensions to other attributes, enabling tailored downstream task performance.

Abstract

A general disentanglement-based speaker anonymization system typically separates speech into content, speaker, and prosody features using individual encoders. This paper explores how to adapt such a system when a new speech attribute, for example, emotion, needs to be preserved to a greater extent. While existing systems are good at anonymizing speaker embeddings, they are not designed to preserve emotion. Two strategies for this are examined. First, we show that integrating emotion embeddings from a pre-trained emotion encoder can help preserve emotional cues, even though this approach slightly compromises privacy protection. Alternatively, we propose an emotion compensation strategy as a post-processing step applied to anonymized speaker embeddings. This conceals the original speaker's identity and reintroduces the emotional traits lost during speaker embedding anonymization. Specifically, we model the emotion attribute using support vector machines to learn separate boundaries for each emotion. During inference, the original speaker embedding is processed in two ways: one, by an emotion indicator to predict emotion and select the emotion-matched SVM accurately; and two, by a speaker anonymizer to conceal speaker characteristics. The anonymized speaker embedding is then modified along the corresponding SVM boundary towards an enhanced emotional direction to save the emotional cues. The proposed strategies are also expected to be useful for adapting a general disentanglement-based speaker anonymization system to preserve other target paralinguistic attributes, with potential for a range of downstream tasks.
Paper Structure (18 sections, 1 equation, 7 figures, 7 tables)

This paper contains 18 sections, 1 equation, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Speaker anonymization task and its evaluation protocol: Users anonymize their original speech to conceal their identities before publication. Meanwhile, attackers use biometric (ASV) technology and their knowledge of the anonymization method to infer the original speaker. The evaluation metrics include equal error rate (EER) of ASV, word error rate (WER) of automatic speech recognition (ASR), and unweighted average recall (UAR) of speech emotion recognition (SER). While the ASV EER measures the goodness of privacy protection, the other two are for utility evaluation.
  • Figure 2: Disentanglement-based speaker anonymization systems and the proposed emotion-enhanced systems built upon OH. Note that the SSL-based Se and other disentanglement-based systems have a similar structure to OH but use different sub-modules (see summary in Table \ref{['tab:notations']}).
  • Figure 3: The emotion compensation procedure. First, the original speaker embedding $\mathbf{x}$ is classified by an emotion indicator to select the appropriate SVM (e.g., happy). The embedding $\mathbf{x}$ is then anonymized as $\mathbf{z}$. Finally, emotion compensation is performed by $\mathbf{z} + \alpha \mathbf{n}$, where $\mathbf{n}$ is the normal vector corresponding to the hyperplane of the 'happy' SVM.
  • Figure 4: Speaker embedding visualization from original, OH, and P3 speech on IEMOCAP-test with 5 speakers and 4 emotions. Colors represent emotions and shapes represent speakers.
  • Figure 5: Emotion embedding visualization from original, OH, and P3 speech on IEMOCAP-test with 5 speakers and 4 emotions. Colors represent emotions and shapes represent speakers.
  • ...and 2 more figures