Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

Thanapat Trachu; Thanathai Lertpetchpun; Sai Praneeth Karimireddy; Shrikanth Narayanan

Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

Thanapat Trachu, Thanathai Lertpetchpun, Sai Praneeth Karimireddy, Shrikanth Narayanan

TL;DR

This work formalizes this task as Speech Generation Speaker Poisoning (SGSP), in which trained models are modified to prevent the generation of specific identities while preserving utility for other speakers, and evaluates inference-time filtering and parameter-modification baselines across 1, 15, and 100 forgotten speakers.

Abstract

Zero-shot Text-to-Speech (TTS) voice cloning poses severe privacy risks, demanding the removal of specific speaker identities from trained TTS models. Conventional machine unlearning is insufficient in this context, as zero-shot TTS can dynamically reconstruct voices from just reference prompts. We formalize this task as Speech Generation Speaker Poisoning (SGSP), in which we modify trained models to prevent the generation of specific identities while preserving utility for other speakers. We evaluate inference-time filtering and parameter-modification baselines across 1, 15, and 100 forgotten speakers. Performance is assessed through the trade-off between utility (WER) and privacy, quantified using AUC and Forget Speaker Similarity (FSSIM). We achieve strong privacy for up to 15 speakers but reveal scalability limits at 100 speakers due to increased identity overlap. Our study thus introduces a novel problem and evaluation framework toward further advances in generative voice privacy.

Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

TL;DR

Abstract

Paper Structure (17 sections, 1 equation, 2 figures, 2 tables)

This paper contains 17 sections, 1 equation, 2 figures, 2 tables.

Introduction
Problem formulation
SGSP Baselines
StyleTTS2
Naïve Baselines
Parameter-Modifying Baselines
Evaluation Metrics
Utility Metrics
Privacy Metrics
Experimental Setup
Dataset
Model
Results and Discussion
Single Speaker Setting
Multiple Speakers Setting
...and 2 more sections

Figures (2)

Figure 1: Schematic overview of TGP and EGP. (a) TGP: The model generates utterances by sampling retain speakers, and these generated utterances serve as the targets for training the student model. During training, the reference speaker is randomly replaced with a sample from $\mathcal{F}$ with a probability of $p_\text{forget}$, encouraging the model to produce a random speaker from $\mathcal{R}$ when conditioned on a speaker from $\mathcal{F}$. (b) EGP: The training process is identical to TGP, except that the ground truth is taken from the encoder output rather than from the teacher-generated utterance.
Figure 2: Cosine similarity distributions for the retained speaker set (in orange) and for the forget set (in blue). Rows correspond to different numbers of forgotten speakers (1, 15, or 100), whereas columns correspond to different system configurations.

Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

TL;DR

Abstract

Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

Authors

TL;DR

Abstract

Table of Contents

Figures (2)