Asynchronous Voice Anonymization Using Adversarial Perturbation On Speaker Embedding

Rui Wang; Liping Chen; Kong AiK Lee; Zhen-Hua Ling

Asynchronous Voice Anonymization Using Adversarial Perturbation On Speaker Embedding

Rui Wang, Liping Chen, Kong AiK Lee, Zhen-Hua Ling

TL;DR

A speech generation framework incorporating a speaker disentanglement mechanism is employed to generate the anonymized speech, and the speaker attributes are altered through adversarial perturbation applied on the speaker embedding, while human perception is preserved by controlling the intensity of perturbation.

Abstract

Voice anonymization has been developed as a technique for preserving privacy by replacing the speaker's voice in a speech signal with that of a pseudo-speaker, thereby obscuring the original voice attributes from machine recognition and human perception. In this paper, we focus on altering the voice attributes against machine recognition while retaining human perception. We referred to this as the asynchronous voice anonymization. To this end, a speech generation framework incorporating a speaker disentanglement mechanism is employed to generate the anonymized speech. The speaker attributes are altered through adversarial perturbation applied on the speaker embedding, while human perception is preserved by controlling the intensity of perturbation. Experiments conducted on the LibriSpeech dataset showed that the speaker attributes were obscured with their human perception preserved for 60.71% of the processed utterances.

Asynchronous Voice Anonymization Using Adversarial Perturbation On Speaker Embedding

TL;DR

Abstract

Paper Structure (15 sections, 6 equations, 3 figures, 3 tables)

This paper contains 15 sections, 6 equations, 3 figures, 3 tables.

Introduction
Background
Speaker embedding
YourTTS
FGSM
Protected speech generation
Overall architecture
Adversarial perturbation on speaker embedding
Experiments
Dataset & configurations
Human perception evaluation
SMOS test
ASV evaluations
ASR evaluations
Conclusions

Figures (3)

Figure 1: The architecture of the discriminative speaker attributes modeling. The input speech utterance is represented as $\mathcal{O}$. Its speaker embedding vector $\mathbf{x}$ is extracted with the speaker encoder $\mathcal{E}\left(\bullet\right)$.
Figure 2: Inference flow of the voice conversion function of YourTTS. In comparison with the official version as described in casanova2022yourtts, the F0 extracted from the source speech is used. The black dotted line separates the modules into information disentanglement and waveform construction. The modules in the rectangular box of red dotted line are used for disentangling the content information from the source speech.
Figure 3: Protected speech generation based on the VC function of YourTTS. The content disentanglement module inherits that from Fig. \ref{['fig:YourTTS']}. The rectangular box of the dotted line is the adversarial attack to generate the perturbed speaker embedding ${\bar{\bf x}}$. The error backpropagation process to obtain the perturbation $\delta$ is denoted by the red arrow line.

Asynchronous Voice Anonymization Using Adversarial Perturbation On Speaker Embedding

TL;DR

Abstract

Asynchronous Voice Anonymization Using Adversarial Perturbation On Speaker Embedding

Authors

TL;DR

Abstract

Table of Contents

Figures (3)