Table of Contents
Fetching ...

Weighted-Sampling Audio Adversarial Example Attack

Xiaolei Liu, Xiaosong Zhang, Kun Wan, Qingxin Zhu, Yufei Ding

TL;DR

This paper tackles the inefficiency and fragility of audio adversarial attacks on ASR by introducing Weighted Perturbation Technology (WPT) and Sampling Perturbation Technology (SPT), which jointly reduce perturbation scope and dynamically weight key regions to accelerate attack convergence. It enhances imperceptibility with a Total Variation Denoising (TVD) based loss, and demonstrates that their approach can produce low-noise, robust adversarial examples within minutes, outperforming prior methods like Carlini & Wagner and CommanderSong in speed and resilience. Through extensive experiments on Mozilla Common Voice with Deepspeech, the authors show faster generation (4–5 minutes), higher SNR and $dB_x(\delta)$, and improved robustness to noise, including a favorable combination with EOT. The work provides practical guidance on loss function design, perturbation strategies, and metric choices, with implications for both offensive capabilities and defensive countermeasures in audio security.

Abstract

Recent studies have highlighted audio adversarial examples as a ubiquitous threat to state-of-the-art automatic speech recognition systems. Thorough studies on how to effectively generate adversarial examples are essential to prevent potential attacks. Despite many research on this, the efficiency and the robustness of existing works are not yet satisfactory. In this paper, we propose~\textit{weighted-sampling audio adversarial examples}, focusing on the numbers and the weights of distortion to reinforce the attack. Further, we apply a denoising method in the loss function to make the adversarial attack more imperceptible. Experiments show that our method is the first in the field to generate audio adversarial examples with low noise and high audio robustness at the minute time-consuming level.

Weighted-Sampling Audio Adversarial Example Attack

TL;DR

This paper tackles the inefficiency and fragility of audio adversarial attacks on ASR by introducing Weighted Perturbation Technology (WPT) and Sampling Perturbation Technology (SPT), which jointly reduce perturbation scope and dynamically weight key regions to accelerate attack convergence. It enhances imperceptibility with a Total Variation Denoising (TVD) based loss, and demonstrates that their approach can produce low-noise, robust adversarial examples within minutes, outperforming prior methods like Carlini & Wagner and CommanderSong in speed and resilience. Through extensive experiments on Mozilla Common Voice with Deepspeech, the authors show faster generation (4–5 minutes), higher SNR and , and improved robustness to noise, including a favorable combination with EOT. The work provides practical guidance on loss function design, perturbation strategies, and metric choices, with implications for both offensive capabilities and defensive countermeasures in audio security.

Abstract

Recent studies have highlighted audio adversarial examples as a ubiquitous threat to state-of-the-art automatic speech recognition systems. Thorough studies on how to effectively generate adversarial examples are essential to prevent potential attacks. Despite many research on this, the efficiency and the robustness of existing works are not yet satisfactory. In this paper, we propose~\textit{weighted-sampling audio adversarial examples}, focusing on the numbers and the weights of distortion to reinforce the attack. Further, we apply a denoising method in the loss function to make the adversarial attack more imperceptible. Experiments show that our method is the first in the field to generate audio adversarial examples with low noise and high audio robustness at the minute time-consuming level.

Paper Structure

This paper contains 20 sections, 8 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: General process of audio adversarial example attack.
  • Figure 2: Overview of CTC and ASL.