Table of Contents
Fetching ...

Zero-Query Adversarial Attack on Black-box Automatic Speech Recognition Systems

Zheng Fang, Tao Wang, Lingchen Zhao, Shenyi Zhang, Bowen Li, Yunjie Ge, Qi Li, Chao Shen, Qian Wang

TL;DR

This work addresses the practicality gap in attacking black-box ASR systems by eliminating the need for target queries. It introduces ZQ-Attack, a zero-query, transfer-based attack that optimizes adversarial perturbations across a diverse set of surrogate ASRs through a sequential ensemble protocol and an adaptive perturbation initialization. The method leverages a novel loss combining adversarial effectiveness, perceptual imperceptibility, and acoustic-feature alignment, along with psychoacoustic-based clipping to maintain stealth. Empirical results show 100% SRoA across multiple online services, open-source ASRs, andcommercial IVC devices, with competitive or superior SNRs compared to prior black-box methods, highlighting the real-world risk and the need for robust defenses.

Abstract

In recent years, extensive research has been conducted on the vulnerability of ASR systems, revealing that black-box adversarial example attacks pose significant threats to real-world ASR systems. However, most existing black-box attacks rely on queries to the target ASRs, which is impractical when queries are not permitted. In this paper, we propose ZQ-Attack, a transfer-based adversarial attack on ASR systems in the zero-query black-box setting. Through a comprehensive review and categorization of modern ASR technologies, we first meticulously select surrogate ASRs of diverse types to generate adversarial examples. Following this, ZQ-Attack initializes the adversarial perturbation with a scaled target command audio, rendering it relatively imperceptible while maintaining effectiveness. Subsequently, to achieve high transferability of adversarial perturbations, we propose a sequential ensemble optimization algorithm, which iteratively optimizes the adversarial perturbation on each surrogate model, leveraging collaborative information from other models. We conduct extensive experiments to evaluate ZQ-Attack. In the over-the-line setting, ZQ-Attack achieves a 100% success rate of attack (SRoA) with an average signal-to-noise ratio (SNR) of 21.91dB on 4 online speech recognition services, and attains an average SRoA of 100% and SNR of 19.67dB on 16 open-source ASRs. For commercial intelligent voice control devices, ZQ-Attack also achieves a 100% SRoA with an average SNR of 15.77dB in the over-the-air setting.

Zero-Query Adversarial Attack on Black-box Automatic Speech Recognition Systems

TL;DR

This work addresses the practicality gap in attacking black-box ASR systems by eliminating the need for target queries. It introduces ZQ-Attack, a zero-query, transfer-based attack that optimizes adversarial perturbations across a diverse set of surrogate ASRs through a sequential ensemble protocol and an adaptive perturbation initialization. The method leverages a novel loss combining adversarial effectiveness, perceptual imperceptibility, and acoustic-feature alignment, along with psychoacoustic-based clipping to maintain stealth. Empirical results show 100% SRoA across multiple online services, open-source ASRs, andcommercial IVC devices, with competitive or superior SNRs compared to prior black-box methods, highlighting the real-world risk and the need for robust defenses.

Abstract

In recent years, extensive research has been conducted on the vulnerability of ASR systems, revealing that black-box adversarial example attacks pose significant threats to real-world ASR systems. However, most existing black-box attacks rely on queries to the target ASRs, which is impractical when queries are not permitted. In this paper, we propose ZQ-Attack, a transfer-based adversarial attack on ASR systems in the zero-query black-box setting. Through a comprehensive review and categorization of modern ASR technologies, we first meticulously select surrogate ASRs of diverse types to generate adversarial examples. Following this, ZQ-Attack initializes the adversarial perturbation with a scaled target command audio, rendering it relatively imperceptible while maintaining effectiveness. Subsequently, to achieve high transferability of adversarial perturbations, we propose a sequential ensemble optimization algorithm, which iteratively optimizes the adversarial perturbation on each surrogate model, leveraging collaborative information from other models. We conduct extensive experiments to evaluate ZQ-Attack. In the over-the-line setting, ZQ-Attack achieves a 100% success rate of attack (SRoA) with an average signal-to-noise ratio (SNR) of 21.91dB on 4 online speech recognition services, and attains an average SRoA of 100% and SNR of 19.67dB on 16 open-source ASRs. For commercial intelligent voice control devices, ZQ-Attack also achieves a 100% SRoA with an average SNR of 15.77dB in the over-the-air setting.
Paper Structure (33 sections, 12 equations, 9 figures, 13 tables, 2 algorithms)

This paper contains 33 sections, 12 equations, 9 figures, 13 tables, 2 algorithms.

Figures (9)

  • Figure 1: The architecture of a typical ASR system.
  • Figure 2: Workflow of ZQ-Attack. ZQ-Attack is mainly divided into three stages: surrogate ASRs selection, perturbation initialization, and sequential ensemble optimization.
  • Figure 3: Illustration of surrogate ASRs selection.
  • Figure 4: An example of the perturbation initialization. The adversarial perturbation is initialized using a scaled target command audio. The region of the added initialized adversarial perturbation is highlighted in red.
  • Figure 5: Illustration of sequential ensemble optimization.
  • ...and 4 more figures