Multi-speaker Text-to-speech Training with Speaker Anonymized Data

Wen-Chin Huang; Yi-Chiao Wu; Tomoki Toda

Multi-speaker Text-to-speech Training with Speaker Anonymized Data

Wen-Chin Huang, Yi-Chiao Wu, Tomoki Toda

TL;DR

This work tackles privacy risks in large-scale TTS by evaluating speaker anonymization (SA) as a data protection step for multi-speaker TTS training. It compares two signal-processing and three DNN-based SA methods, anonymizes the VCTK data, and trains a VITS-based TTS to synthesize unseen speakers, using objective metrics ($EER$, $WER$, $GVD$, $UTMOS$) and subjective ratings. The study finds that high $UTMOS$ (perceived output quality) and high $GVD$ (voice diversity preservation) are strong predictors of favorable downstream TTS performance, with $UTMOS$ correlating strongly with naturalness and $GVD$ with speaker distinctiveness. The results provide practical guidelines for selecting SA methods and thresholds to balance privacy, quality, and multi-speaker viability, contributing a framework to evaluate SA systems for privacy-preserving speech synthesis.

Abstract

The trend of scaling up speech generation models poses a threat of biometric information leakage of the identities of the voices in the training data, raising privacy and security concerns. In this paper, we investigate training multi-speaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA), a process that tends to hide the speaker identity of the input speech while maintaining other attributes. Two signal processing-based and three deep neural network-based SA methods were used to anonymize VCTK, a multi-speaker TTS dataset, which is further used to train an end-to-end TTS model, VITS, to perform unseen speaker TTS during the testing phase. We conducted extensive objective and subjective experiments to evaluate the anonymized training data, as well as the performance of the downstream TTS model trained using those data. Importantly, we found that UTMOS, a data-driven subjective rating predictor model, and GVD, a metric that measures the gain of voice distinctiveness, are good indicators of the downstream TTS performance. We summarize insights in the hope of helping future researchers determine the goodness of the SA system for multi-speaker TTS training.

Multi-speaker Text-to-speech Training with Speaker Anonymized Data

TL;DR

) and subjective ratings. The study finds that high

(perceived output quality) and high

(voice diversity preservation) are strong predictors of favorable downstream TTS performance, with

correlating strongly with naturalness and

with speaker distinctiveness. The results provide practical guidelines for selecting SA methods and thresholds to balance privacy, quality, and multi-speaker viability, contributing a framework to evaluate SA systems for privacy-preserving speech synthesis.

Abstract

Paper Structure (17 sections, 3 figures, 2 tables)

This paper contains 17 sections, 3 figures, 2 tables.

Introduction
Problem Formulation
Speaker Anonymization Systems
Signal processing based systems
Pitch shift
VPC'22 B2: Spectral envelope modification
Deep neural network based systems
VPC'22 B1b
GAN
NACLM
Experimental Evaluation
Data and implementation
Evaluation Metrics and Protocols
Evaluation results of the anonymized training data
Evaluation results of the downstream TTS task
...and 2 more sections

Figures (3)

Figure 1: Problem formulation and goals of this work.
Figure 2: Deep neural network-based speaker anonymization systems.
Figure 3: Scatter plots of the SA and TTS subjective evaluation results. The goal icons are located at the ideal score positions. Blue and green dots indicate signal processing- based and deep neural network-based systems, respectively.

Multi-speaker Text-to-speech Training with Speaker Anonymized Data

TL;DR

Abstract

Multi-speaker Text-to-speech Training with Speaker Anonymized Data

Authors

TL;DR

Abstract

Table of Contents

Figures (3)