Table of Contents
Fetching ...

Impact of Phonetics on Speaker Identity in Adversarial Voice Attack

Daniyal Kabir Dar, Qiben Yan, Li Xiao, Arun Ross

TL;DR

This paper investigates how phonetic structure shapes adversarial audio effects on both transcription and speaker identity. It applies a white-box targeted attack against DeepSpeech to drive transcripts to chosen targets and to induce drift in speaker embeddings, quantified by transcription metrics and the discriminability index $d'$. The findings show that phoneme classes such as fricatives and dense consonant clusters are particularly vulnerable, while vowel-rich phrases are more robust, and longer utterances amplify identity drift. The work highlights the need for phoneme-aware defenses and linguistically grounded evaluation to secure joint ASR and speaker verification systems.

Abstract

Adversarial perturbations in speech pose a serious threat to automatic speech recognition (ASR) and speaker verification by introducing subtle waveform modifications that remain imperceptible to humans but can significantly alter system outputs. While targeted attacks on end-to-end ASR models have been widely studied, the phonetic basis of these perturbations and their effect on speaker identity remain underexplored. In this work, we analyze adversarial audio at the phonetic level and show that perturbations exploit systematic confusions such as vowel centralization and consonant substitutions. These distortions not only mislead transcription but also degrade phonetic cues critical for speaker verification, leading to identity drift. Using DeepSpeech as our ASR target, we generate targeted adversarial examples and evaluate their impact on speaker embeddings across genuine and impostor samples. Results across 16 phonetically diverse target phrases demonstrate that adversarial audio induces both transcription errors and identity drift, highlighting the need for phonetic-aware defenses to ensure the robustness of ASR and speaker recognition systems.

Impact of Phonetics on Speaker Identity in Adversarial Voice Attack

TL;DR

This paper investigates how phonetic structure shapes adversarial audio effects on both transcription and speaker identity. It applies a white-box targeted attack against DeepSpeech to drive transcripts to chosen targets and to induce drift in speaker embeddings, quantified by transcription metrics and the discriminability index . The findings show that phoneme classes such as fricatives and dense consonant clusters are particularly vulnerable, while vowel-rich phrases are more robust, and longer utterances amplify identity drift. The work highlights the need for phoneme-aware defenses and linguistically grounded evaluation to secure joint ASR and speaker verification systems.

Abstract

Adversarial perturbations in speech pose a serious threat to automatic speech recognition (ASR) and speaker verification by introducing subtle waveform modifications that remain imperceptible to humans but can significantly alter system outputs. While targeted attacks on end-to-end ASR models have been widely studied, the phonetic basis of these perturbations and their effect on speaker identity remain underexplored. In this work, we analyze adversarial audio at the phonetic level and show that perturbations exploit systematic confusions such as vowel centralization and consonant substitutions. These distortions not only mislead transcription but also degrade phonetic cues critical for speaker verification, leading to identity drift. Using DeepSpeech as our ASR target, we generate targeted adversarial examples and evaluate their impact on speaker embeddings across genuine and impostor samples. Results across 16 phonetically diverse target phrases demonstrate that adversarial audio induces both transcription errors and identity drift, highlighting the need for phonetic-aware defenses to ensure the robustness of ASR and speaker recognition systems.

Paper Structure

This paper contains 10 sections, 3 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of the adversarial audio attack framework. A perturbation $\Delta x$ is applied to source audio to produce adversarial audio. The Automatic Speech Recognition (ASR) model yields adversarial transcriptions, while the Speaker Recognition model produces embeddings that drift from the source identity, degrading biometric similarity.
  • Figure 2: Mean SNR (bars, left axis) and mean cosine similarity (lines, right axis) across all 16 target transcriptions (T1–T16). Similarity trends are shown for both ECAPA and ResNet50 embeddings.
  • Figure 3: $d'$ across target texts (T1--T16) for ECAPA vs. RESNET50.