Table of Contents
Fetching ...

CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning

Renyuan Li, Zhibo Liang, Haichuan Zhang, Tianyu Shi, Zhiyuan Cheng, Jia Shi, Carl Yang, Mingjie Tang

TL;DR

Zero-shot TTS enables cloning from seconds of reference audio, creating privacy risks. The authors propose CloneShield, a universal time-domain adversarial perturbation framework that defends during TTS inference by learning a shared perturbation across utterances with MGDA and then refining in the mel-spectrogram domain for imperceptibility. On three state-of-the-art zero-shot TTS models and five datasets, the method preserves input naturalness (PESQ ≈ 3.90, SRS ≈ 0.93) while drastically reducing speaker similarity in clones (SRS ≤ 0.08) and achieving high defense success rates. These results demonstrate a practical, model-agnostic approach to privacy-preserving voice AI with broad real-world implications, accompanied by ethical and deployment considerations.

Abstract

Recent breakthroughs in text-to-speech (TTS) voice cloning have raised serious privacy concerns, allowing highly accurate vocal identity replication from just a few seconds of reference audio, while retaining the speaker's vocal authenticity. In this paper, we introduce CloneShield, a universal time-domain adversarial perturbation framework specifically designed to defend against zero-shot voice cloning. Our method provides protection that is robust across speakers and utterances, without requiring any prior knowledge of the synthesized text. We formulate perturbation generation as a multi-objective optimization problem, and propose Multi-Gradient Descent Algorithm (MGDA) to ensure the robust protection across diverse utterances. To preserve natural auditory perception for users, we decompose the adversarial perturbation via Mel-spectrogram representations and fine-tune it for each sample. This design ensures imperceptibility while maintaining strong degradation effects on zero-shot cloned outputs. Experiments on three state-of-the-art zero-shot TTS systems, five benchmark datasets and evaluations from 60 human listeners demonstrate that our method preserves near-original audio quality in protected inputs (PESQ = 3.90, SRS = 0.93) while substantially degrading both speaker similarity and speech quality in cloned samples (PESQ = 1.07, SRS = 0.08).

CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning

TL;DR

Zero-shot TTS enables cloning from seconds of reference audio, creating privacy risks. The authors propose CloneShield, a universal time-domain adversarial perturbation framework that defends during TTS inference by learning a shared perturbation across utterances with MGDA and then refining in the mel-spectrogram domain for imperceptibility. On three state-of-the-art zero-shot TTS models and five datasets, the method preserves input naturalness (PESQ ≈ 3.90, SRS ≈ 0.93) while drastically reducing speaker similarity in clones (SRS ≤ 0.08) and achieving high defense success rates. These results demonstrate a practical, model-agnostic approach to privacy-preserving voice AI with broad real-world implications, accompanied by ethical and deployment considerations.

Abstract

Recent breakthroughs in text-to-speech (TTS) voice cloning have raised serious privacy concerns, allowing highly accurate vocal identity replication from just a few seconds of reference audio, while retaining the speaker's vocal authenticity. In this paper, we introduce CloneShield, a universal time-domain adversarial perturbation framework specifically designed to defend against zero-shot voice cloning. Our method provides protection that is robust across speakers and utterances, without requiring any prior knowledge of the synthesized text. We formulate perturbation generation as a multi-objective optimization problem, and propose Multi-Gradient Descent Algorithm (MGDA) to ensure the robust protection across diverse utterances. To preserve natural auditory perception for users, we decompose the adversarial perturbation via Mel-spectrogram representations and fine-tune it for each sample. This design ensures imperceptibility while maintaining strong degradation effects on zero-shot cloned outputs. Experiments on three state-of-the-art zero-shot TTS systems, five benchmark datasets and evaluations from 60 human listeners demonstrate that our method preserves near-original audio quality in protected inputs (PESQ = 3.90, SRS = 0.93) while substantially degrading both speaker similarity and speech quality in cloned samples (PESQ = 1.07, SRS = 0.08).

Paper Structure

This paper contains 33 sections, 12 equations, 3 figures, 7 tables, 2 algorithms.

Figures (3)

  • Figure 1: Overview of our CloneShield framework. We inject imperceptible perturbations to disrupt the unauthorized voice replication. The system consists of ❶Universal protective perturbation generation via multi-objective optimization. ❷Perceptual-frequency domain refinement/fine-tune using mel-spectrogram decomposition. ❸Real-world deployment scenarios showcasing how the perturbation thwarts unauthorized voice replication.
  • Figure 2: We selected an audio sample for spectrogram visualization. B1, B2, and B3 represent three baseline methods: VoiceBox, AudioSeal, and Timbre Watermarking, respectively. It can be observed that our method introduces almost no perceptible difference between the protected and original audio. In contrast, the substantial discrepancy between the output of our method and the attacked output highlights the effectiveness of our defense.
  • Figure 3: Additional Spectrograms