SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis
Zhisheng Zhang, Derui Wang, Qianyi Yang, Pengyang Huang, Junhan Pu, Yuxin Cao, Kai Ye, Jie Hao, Yixian Yang
TL;DR
SafeSpeech tackles the widespread risk of malicious speech synthesis by introducing a proactive, training-time defense that embeds imperceptible perturbations into a user’s audio before uploading. The core method, Speech PERTurbative Concealment (SPEC), uses a surrogate TTS model and a universal objective based on mel-spectrogram similarity to generate a perturbation that both degrades synthesis quality and conceals speaker timbre, while a perceptual loss based on STOI and STFT ensures human-auditory acceptability. Empirical results demonstrate state-of-the-art protection across zero-shot and fine-tuning scenarios, strong transferability to diverse models, robustness against adaptive attacks, and real-time operation in real-world tests. The work contributes a practical, scalable solution for protecting voice privacy and reducing deepfake risks, with public code and data to enable adoption and further research.
Abstract
Speech synthesis technology has brought great convenience, while the widespread usage of realistic deepfake audio has triggered hazards. Malicious adversaries may unauthorizedly collect victims' speeches and clone a similar voice for illegal exploitation (\textit{e.g.}, telecom fraud). However, the existing defense methods cannot effectively prevent deepfake exploitation and are vulnerable to robust training techniques. Therefore, a more effective and robust data protection method is urgently needed. In response, we propose a defensive framework, \textit{\textbf{SafeSpeech}}, which protects the users' audio before uploading by embedding imperceptible perturbations on original speeches to prevent high-quality synthetic speech. In SafeSpeech, we devise a robust and universal proactive protection technique, \textbf{S}peech \textbf{PE}rturbative \textbf{C}oncealment (\textbf{SPEC}), that leverages a surrogate model to generate universally applicable perturbation for generative synthetic models. Moreover, we optimize the human perception of embedded perturbation in terms of time and frequency domains. To evaluate our method comprehensively, we conduct extensive experiments across advanced models and datasets, both subjectively and objectively. Our experimental results demonstrate that SafeSpeech achieves state-of-the-art (SOTA) voice protection effectiveness and transferability and is highly robust against advanced adaptive adversaries. Moreover, SafeSpeech has real-time capability in real-world tests. The source code is available at \href{https://github.com/wxzyd123/SafeSpeech}{https://github.com/wxzyd123/SafeSpeech}.
