DiffAttack: Diffusion-based Timbre-reserved Adversarial Attack in Speaker Identification
Qing Wang, Jixun Yao, Zhaokai Sun, Pengcheng Guo, Lei Xie, John H. L. Hansen
TL;DR
DiffAttack targets speaker identification security by embedding adversarial constraints into a diffusion-based voice-conversion framework to produce timbre-preserved adversarial audio directed at a specific speaker. The method employs a forward diffusion encoder to yield speaker-independent representations and a reverse diffusion decoder conditioned on a target speaker, with an adversarial constraint guided by a speaker classifier to steer the reconstruction toward the target distribution. On LibriTTS and VoxCeleb-based evaluations, DiffAttack significantly raises the targeted attack success rate (e.g., 65.76% vs 28.40% for the baseline) while maintaining objective and subjective audio quality, outperforming prior constraint-based approaches. This work exposes a realistic vulnerability in SID systems and provides a framework for robust evaluation and defense development against timbre-preserved adversarial threats.
Abstract
Being a form of biometric identification, the security of the speaker identification (SID) system is of utmost importance. To better understand the robustness of SID systems, we aim to perform more realistic attacks in SID, which are challenging for both humans and machines to detect. In this study, we propose DiffAttack, a novel timbre-reserved adversarial attack approach that exploits the capability of a diffusion-based voice conversion (DiffVC) model to generate adversarial fake audio with distinct target speaker attribution. By introducing adversarial constraints into the generative process of the diffusion-based voice conversion model, we craft fake samples that effectively mislead target models while preserving speaker-wise characteristics. Specifically, inspired by the use of randomly sampled Gaussian noise in conventional adversarial attacks and diffusion processes, we incorporate adversarial constraints into the reverse diffusion process. These constraints subtly guide the reverse diffusion process toward aligning with the target speaker distribution. Our experiments on the LibriTTS dataset indicate that DiffAttack significantly improves the attack success rate compared to vanilla DiffVC and other methods. Moreover, objective and subjective evaluations demonstrate that introducing adversarial constraints does not compromise the speech quality generated by the DiffVC model.
