Table of Contents
Fetching ...

Comparative Study on Noise-Augmented Training and its Effect on Adversarial Robustness in ASR Systems

Karla Pizzi, Matías Pizarro, Asja Fischer

TL;DR

This work investigates whether noise-augmented training enhances adversarial robustness in ASR. Using four end-to-end SpeechBrain architectures trained under three augmentation regimes, the authors evaluate against white-box C&W and two untargeted black-box attacks, employing SI-SDR, dB$_x$, and SNR$_{seg}$ to measure perceptual distortion. The results show that noise augmentation improves both noise robustness and adversarial robustness across architectures, with seq2seq models benefiting most and transformer-based models showing moderate gains. These findings support adopting noise-aware augmentation as a practical, scalable defense to bolster the reliability and security of ASR systems in real-world environments.

Abstract

In this study, we investigate whether noise-augmented training can concurrently improve adversarial robustness in automatic speech recognition (ASR) systems. We conduct a comparative analysis of the adversarial robustness of four different ASR architectures, each trained under three different augmentation conditions: (1) background noise, speed variations, and reverberations; (2) speed variations only; (3) no data augmentation. We then evaluate the robustness of all resulting models against attacks with white-box or black-box adversarial examples. Our results demonstrate that noise augmentation not only enhances model performance on noisy speech but also improves the model's robustness to adversarial attacks.

Comparative Study on Noise-Augmented Training and its Effect on Adversarial Robustness in ASR Systems

TL;DR

This work investigates whether noise-augmented training enhances adversarial robustness in ASR. Using four end-to-end SpeechBrain architectures trained under three augmentation regimes, the authors evaluate against white-box C&W and two untargeted black-box attacks, employing SI-SDR, dB, and SNR to measure perceptual distortion. The results show that noise augmentation improves both noise robustness and adversarial robustness across architectures, with seq2seq models benefiting most and transformer-based models showing moderate gains. These findings support adopting noise-aware augmentation as a practical, scalable defense to bolster the reliability and security of ASR systems in real-world environments.

Abstract

In this study, we investigate whether noise-augmented training can concurrently improve adversarial robustness in automatic speech recognition (ASR) systems. We conduct a comparative analysis of the adversarial robustness of four different ASR architectures, each trained under three different augmentation conditions: (1) background noise, speed variations, and reverberations; (2) speed variations only; (3) no data augmentation. We then evaluate the robustness of all resulting models against attacks with white-box or black-box adversarial examples. Our results demonstrate that noise augmentation not only enhances model performance on noisy speech but also improves the model's robustness to adversarial attacks.
Paper Structure (28 sections, 1 equation, 4 figures, 4 tables)

This paper contains 28 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The spectrogram of (a) the original audio signal is compared to (b) the spectrogram of its corresponding C&W adversarial example, and (c) the spectrogram of the adversarial noise.
  • Figure 2: The spectrogram of (a) the original audio signal is compared to (b) the spectrogram of its corresponding Alzantot adversarial example, and (c) the spectrogram of the adversarial noise.
  • Figure 3: The spectrogram of (a) the original audio signal is compared to (b) the spectrogram of its corresponding Kenansville adversarial example, and (c) the spectrogram of the adversarial noise.
  • Figure 4: The spectrogram of the original audio signal (a) is compared to the spectrogram of its corresponding noisy example, that in this case is constructed by adding background noise to the original audio signal (b) and the spectrogram of the noise (c), which is the difference between the original and its noisy counterpart.