Table of Contents
Fetching ...

Selective Masking Adversarial Attack on Automatic Speech Recognition Systems

Zheng Fang, Shenyi Zhang, Tao Wang, Bowen Li, Lingchen Zhao, Zhangyi Wang

TL;DR

This work tackles adversarial vulnerabilities of ASR in dual-source audio by proposing the Selective Masking Adversarial (SMA) attack, which forces the ASR to transcribe only the selected source while the other source is masked. SMA combines a novel dual-source initialization with a selective masking optimization, optimizing a multi-objective loss that includes $\mathcal{L}_{adv}$, $\mathcal{L}_{mel}$, and $\mathcal{L}_p$ to balance effectiveness and imperceptibility. Empirical results across four state-of-the-art ASR systems show SMA achieving an average SRoA of 100% and average $SNR$ of 31.99 dB, with a standout $SNR$ of 37.15 dB on Conformer-CTC, significantly outperforming baselines like ZQ-Attack in perceptual quality. The model also demonstrates nontrivial transferability to Whisper and Azure, underscoring practical implications for both offensive research and defense-focused ASR robustness work.

Abstract

Extensive research has shown that Automatic Speech Recognition (ASR) systems are vulnerable to audio adversarial attacks. Current attacks mainly focus on single-source scenarios, ignoring dual-source scenarios where two people are speaking simultaneously. To bridge the gap, we propose a Selective Masking Adversarial attack, namely SMA attack, which ensures that one audio source is selected for recognition while the other audio source is muted in dual-source scenarios. To better adapt to the dual-source scenario, our SMA attack constructs the normal dual-source audio from the muted audio and selected audio. SMA attack initializes the adversarial perturbation with a small Gaussian noise and iteratively optimizes it using a selective masking optimization algorithm. Extensive experiments demonstrate that the SMA attack can generate effective and imperceptible audio adversarial examples in the dual-source scenario, achieving an average success rate of attack of 100% and signal-to-noise ratio of 37.15dB on Conformer-CTC, outperforming the baselines.

Selective Masking Adversarial Attack on Automatic Speech Recognition Systems

TL;DR

This work tackles adversarial vulnerabilities of ASR in dual-source audio by proposing the Selective Masking Adversarial (SMA) attack, which forces the ASR to transcribe only the selected source while the other source is masked. SMA combines a novel dual-source initialization with a selective masking optimization, optimizing a multi-objective loss that includes , , and to balance effectiveness and imperceptibility. Empirical results across four state-of-the-art ASR systems show SMA achieving an average SRoA of 100% and average of 31.99 dB, with a standout of 37.15 dB on Conformer-CTC, significantly outperforming baselines like ZQ-Attack in perceptual quality. The model also demonstrates nontrivial transferability to Whisper and Azure, underscoring practical implications for both offensive research and defense-focused ASR robustness work.

Abstract

Extensive research has shown that Automatic Speech Recognition (ASR) systems are vulnerable to audio adversarial attacks. Current attacks mainly focus on single-source scenarios, ignoring dual-source scenarios where two people are speaking simultaneously. To bridge the gap, we propose a Selective Masking Adversarial attack, namely SMA attack, which ensures that one audio source is selected for recognition while the other audio source is muted in dual-source scenarios. To better adapt to the dual-source scenario, our SMA attack constructs the normal dual-source audio from the muted audio and selected audio. SMA attack initializes the adversarial perturbation with a small Gaussian noise and iteratively optimizes it using a selective masking optimization algorithm. Extensive experiments demonstrate that the SMA attack can generate effective and imperceptible audio adversarial examples in the dual-source scenario, achieving an average success rate of attack of 100% and signal-to-noise ratio of 37.15dB on Conformer-CTC, outperforming the baselines.

Paper Structure

This paper contains 11 sections, 6 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of the scenarios. (a) Recognition in the single-source scenario. (b) Recognition in the dual-source scenario. (c) Adversarial attack in the dual-source scenario.
  • Figure 2: The overview of our proposed SMA attack. This attack consists of two stages: dual-source initialization and selective masking optimization. In the dual-source initialization stage, SMA attack constructs the normal dual-source audio from the muted audio and selected audio, and initializes the adversarial perturbation using Gaussian noise. In the second stage, SMA attacks uses a selective masking optimization algorithm, with a loss function that includes adversarial loss, mel-spectrogram loss, and imperceptibility loss.
  • Figure 3: Detailed results of SMA attack on Conformer-CTC. The SNR is represented as 0 dB when the muted audio and selected audio are identical.
  • Figure 4: Waveforms of the muted audio, selected audio, corresponding normal dual-source audio, and the adversarial example.
  • Figure 5: Transferability of SMA Attack from Conformer-CTC to Whisper.
  • ...and 1 more figures