Selective Masking Adversarial Attack on Automatic Speech Recognition Systems
Zheng Fang, Shenyi Zhang, Tao Wang, Bowen Li, Lingchen Zhao, Zhangyi Wang
TL;DR
This work tackles adversarial vulnerabilities of ASR in dual-source audio by proposing the Selective Masking Adversarial (SMA) attack, which forces the ASR to transcribe only the selected source while the other source is masked. SMA combines a novel dual-source initialization with a selective masking optimization, optimizing a multi-objective loss that includes $\mathcal{L}_{adv}$, $\mathcal{L}_{mel}$, and $\mathcal{L}_p$ to balance effectiveness and imperceptibility. Empirical results across four state-of-the-art ASR systems show SMA achieving an average SRoA of 100% and average $SNR$ of 31.99 dB, with a standout $SNR$ of 37.15 dB on Conformer-CTC, significantly outperforming baselines like ZQ-Attack in perceptual quality. The model also demonstrates nontrivial transferability to Whisper and Azure, underscoring practical implications for both offensive research and defense-focused ASR robustness work.
Abstract
Extensive research has shown that Automatic Speech Recognition (ASR) systems are vulnerable to audio adversarial attacks. Current attacks mainly focus on single-source scenarios, ignoring dual-source scenarios where two people are speaking simultaneously. To bridge the gap, we propose a Selective Masking Adversarial attack, namely SMA attack, which ensures that one audio source is selected for recognition while the other audio source is muted in dual-source scenarios. To better adapt to the dual-source scenario, our SMA attack constructs the normal dual-source audio from the muted audio and selected audio. SMA attack initializes the adversarial perturbation with a small Gaussian noise and iteratively optimizes it using a selective masking optimization algorithm. Extensive experiments demonstrate that the SMA attack can generate effective and imperceptible audio adversarial examples in the dual-source scenario, achieving an average success rate of attack of 100% and signal-to-noise ratio of 37.15dB on Conformer-CTC, outperforming the baselines.
