SyntheticPop: Attacking Speaker Verification Systems With Synthetic VoicePops
Eshaq Jamdar, Amith Kamath Belman
TL;DR
Voice authentication systems like VA+VoicePop use phoneme-based liveliness cues and GFCC features to detect spoofing, but the paper reveals a new data-poisoning vulnerability. The authors introduce SyntheticPop, which embeds synthetic pop noises into spoofed audio to disrupt phoneme recognition, achieving an attack success rate exceeding 95% with 20% training data poisoned and dropping overall accuracy from about 69% (unpoisoned full training) to 14% under SyntheticPop. They replicate the VA+VoicePop pipeline on the ASVSpoof 2019 dataset, compare against a baseline label-flipping attack, and show that simple label flips have a modest effect, whereas SyntheticPop causes dramatic degradation. The findings highlight robustness gaps in VA+VoicePop under data-poisoning and motivate future defenses, more targeted attack analyses, and potential transfer of such attacks to other modalities and diffusion-based deepfake scenarios.
Abstract
Voice Authentication (VA), also known as Automatic Speaker Verification (ASV), is a widely adopted authentication method, particularly in automated systems like banking services, where it serves as a secondary layer of user authentication. Despite its popularity, VA systems are vulnerable to various attacks, including replay, impersonation, and the emerging threat of deepfake audio that mimics the voice of legitimate users. To mitigate these risks, several defense mechanisms have been proposed. One such solution, Voice Pops, aims to distinguish an individual's unique phoneme pronunciations during the enrollment process. While promising, the effectiveness of VA+VoicePop against a broader range of attacks, particularly logical or adversarial attacks, remains insufficiently explored. We propose a novel attack method, which we refer to as SyntheticPop, designed to target the phoneme recognition capabilities of the VA+VoicePop system. The SyntheticPop attack involves embedding synthetic "pop" noises into spoofed audio samples, significantly degrading the model's performance. We achieve an attack success rate of over 95% while poisoning 20% of the training dataset. Our experiments demonstrate that VA+VoicePop achieves 69% accuracy under normal conditions, 37% accuracy when subjected to a baseline label flipping attack, and just 14% accuracy under our proposed SyntheticPop attack, emphasizing the effectiveness of our method.
