Table of Contents
Fetching ...

Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models

Vyas Raina, Rao Ma, Charles McGhee, Kate Knill, Mark Gales

TL;DR

This paper reveals a practical vulnerability in Whisper-based ASR systems: a universal, ultra-short acoustic segment prepended to any speech can mute the model by acoustically realizing the <|endoftext|> token. The authors learn a single 0.64-second adversarial audio segment that, when concatenated with arbitrary input, causes Whisper to emit an empty transcription with over 97% success across eight model sizes and several datasets. They further show strong transferability of the attack across data domains and even across tasks (transcription and translation), while analyzing the underlying saliency mechanisms and performing ablations on imperceptibility. The work emphasizes security implications for speech moderation and privacy, and calls for defenses to improve robustness of speech foundation models against such muting attacks.

Abstract

Recent developments in large speech foundation models like Whisper have led to their widespread use in many automatic speech recognition (ASR) applications. These systems incorporate `special tokens' in their vocabulary, such as $\texttt{<|endoftext|>}$, to guide their language generation process. However, we demonstrate that these tokens can be exploited by adversarial attacks to manipulate the model's behavior. We propose a simple yet effective method to learn a universal acoustic realization of Whisper's $\texttt{<|endoftext|>}$ token, which, when prepended to any speech signal, encourages the model to ignore the speech and only transcribe the special token, effectively `muting' the model. Our experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper ASR model for over 97\% of speech samples. Moreover, we find that this universal adversarial audio segment often transfers to new datasets and tasks. Overall this work demonstrates the vulnerability of Whisper models to `muting' adversarial attacks, where such attacks can pose both risks and potential benefits in real-world settings: for example the attack can be used to bypass speech moderation systems, or conversely the attack can also be used to protect private speech data.

Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models

TL;DR

This paper reveals a practical vulnerability in Whisper-based ASR systems: a universal, ultra-short acoustic segment prepended to any speech can mute the model by acoustically realizing the <|endoftext|> token. The authors learn a single 0.64-second adversarial audio segment that, when concatenated with arbitrary input, causes Whisper to emit an empty transcription with over 97% success across eight model sizes and several datasets. They further show strong transferability of the attack across data domains and even across tasks (transcription and translation), while analyzing the underlying saliency mechanisms and performing ablations on imperceptibility. The work emphasizes security implications for speech moderation and privacy, and calls for defenses to improve robustness of speech foundation models against such muting attacks.

Abstract

Recent developments in large speech foundation models like Whisper have led to their widespread use in many automatic speech recognition (ASR) applications. These systems incorporate `special tokens' in their vocabulary, such as , to guide their language generation process. However, we demonstrate that these tokens can be exploited by adversarial attacks to manipulate the model's behavior. We propose a simple yet effective method to learn a universal acoustic realization of Whisper's token, which, when prepended to any speech signal, encourages the model to ignore the speech and only transcribe the special token, effectively `muting' the model. Our experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper ASR model for over 97\% of speech samples. Moreover, we find that this universal adversarial audio segment often transfers to new datasets and tasks. Overall this work demonstrates the vulnerability of Whisper models to `muting' adversarial attacks, where such attacks can pose both risks and potential benefits in real-world settings: for example the attack can be used to bypass speech moderation systems, or conversely the attack can also be used to protect private speech data.
Paper Structure (41 sections, 20 equations, 7 figures, 15 tables)

This paper contains 41 sections, 20 equations, 7 figures, 15 tables.

Figures (7)

  • Figure 1: Universal adversarial audio segment when prepended to any speech signal mutes Whisper, such that an empty transcription is generated. The $\texttt{<endoftext>}$ token (EOT) is a special token in the Whisper vocabulary used to indicate the end of the generated transcription.
  • Figure 2: Mel spectrogram of universal acoustic segment (0.64s) prepended to a (truncated) random speech sample from LBS dataset.
  • Figure 3: Ablation on the universal acoustic adversarial attack segment length.
  • Figure 4: Ablation on the universal acoustic adversarial attack amplitude constraint, $\epsilon$.
  • Figure 5: Frame-level saliency plot, where the first 0.64-second represents the universal acoustic attack segment and the remainder is a randomly sampled speech signal (truncated to a total length of 3 seconds) for the target model Whisper medium.en was un/successfully muted by the universal adversarial attack.
  • ...and 2 more figures