Table of Contents
Fetching ...

SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise

Rui Sang, Yuxuan Liu

TL;DR

This work tackles privacy risks from voice cloning by moving beyond imperceptible perturbations to a scene-aware defense. SceneGuard trains a protection that injects scene-consistent audible background noise by jointly optimizing a temporal mask and noise strength, guided by acoustic scene classification. The method degrades speaker similarity significantly (e.g., SIM reduction to 0.945 with p < 10^{-15} and Cohen's d = 2.18) while preserving intelligibility (STOI ≈ 0.986, WER ≈ 3.6%), and remains robust to countermeasures such as MP3 compression, denoising, and filtering. This approach, grounded in perceptual and scene-context considerations, offers a practical and resilient alternative to imperceptible perturbations for protecting training data against voice cloning threats, with released code for reproducibility.

Abstract

Voice cloning technology poses significant privacy threats by enabling unauthorized speech synthesis from limited audio samples. Existing defenses based on imperceptible adversarial perturbations are vulnerable to common audio preprocessing such as denoising and compression. We propose SceneGuard, a training-time voice protection method that applies scene-consistent audible background noise to speech recordings. Unlike imperceptible perturbations, SceneGuard leverages naturally occurring acoustic scenes (e.g., airport, street, park) to create protective noise that is contextually appropriate and robust to countermeasures. We evaluate SceneGuard on text-to-speech training attacks, demonstrating 5.5% speaker similarity degradation with extremely high statistical significance (p < 10^{-15}, Cohen's d = 2.18) while preserving 98.6% speech intelligibility (STOI = 0.986). Robustness evaluation shows that SceneGuard maintains or enhances protection under five common countermeasures including MP3 compression, spectral subtraction, lowpass filtering, and downsampling. Our results suggest that audible, scene-consistent noise provides a more robust alternative to imperceptible perturbations for training-time voice protection. The source code are available at: https://github.com/richael-sang/SceneGuard.

SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise

TL;DR

This work tackles privacy risks from voice cloning by moving beyond imperceptible perturbations to a scene-aware defense. SceneGuard trains a protection that injects scene-consistent audible background noise by jointly optimizing a temporal mask and noise strength, guided by acoustic scene classification. The method degrades speaker similarity significantly (e.g., SIM reduction to 0.945 with p < 10^{-15} and Cohen's d = 2.18) while preserving intelligibility (STOI ≈ 0.986, WER ≈ 3.6%), and remains robust to countermeasures such as MP3 compression, denoising, and filtering. This approach, grounded in perceptual and scene-context considerations, offers a practical and resilient alternative to imperceptible perturbations for protecting training data against voice cloning threats, with released code for reproducibility.

Abstract

Voice cloning technology poses significant privacy threats by enabling unauthorized speech synthesis from limited audio samples. Existing defenses based on imperceptible adversarial perturbations are vulnerable to common audio preprocessing such as denoising and compression. We propose SceneGuard, a training-time voice protection method that applies scene-consistent audible background noise to speech recordings. Unlike imperceptible perturbations, SceneGuard leverages naturally occurring acoustic scenes (e.g., airport, street, park) to create protective noise that is contextually appropriate and robust to countermeasures. We evaluate SceneGuard on text-to-speech training attacks, demonstrating 5.5% speaker similarity degradation with extremely high statistical significance (p < 10^{-15}, Cohen's d = 2.18) while preserving 98.6% speech intelligibility (STOI = 0.986). Robustness evaluation shows that SceneGuard maintains or enhances protection under five common countermeasures including MP3 compression, spectral subtraction, lowpass filtering, and downsampling. Our results suggest that audible, scene-consistent noise provides a more robust alternative to imperceptible perturbations for training-time voice protection. The source code are available at: https://github.com/richael-sang/SceneGuard.

Paper Structure

This paper contains 63 sections, 9 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Method overview of SceneGuard. Given speech $x$ and a scene label $s$ (ASC or user-provided), we sample scene-consistent noise $n_k\!\in\!\mathcal{N}_s$ and generate protected audio via the mixer $x'(t)$. A lightweight optimization updates the temporal mask $m(t)$ and strength $\gamma$ to minimize speaker similarity (ECAPA) under SNR and smoothness constraints; outputs are evaluated under training-time and zero-shot cloning protocols.
  • Figure 2: Training attack comparison showing speaker similarity degradation for different defense methods. Error bars represent 95% bootstrap confidence intervals. Significance markers: $p < 0.01$, $p < 0.001$.
  • Figure 3: Robustness heatmap showing speaker similarity and WER under different audio preprocessing countermeasures. Darker colors indicate stronger protection (lower similarity). Three countermeasures enhance protection beyond the baseline.
  • Figure 4: SNR range ablation showing the trade-off between protection (measured as similarity degradation, left y-axis) and usability (measured as STOI, right y-axis). The [10, 20] dB range (marked with a star) provides optimal balance.