Table of Contents
Fetching ...

Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models

Vyas Raina, Mark Gales

TL;DR

A new form of adversarial attack on multi-tasking speech enabled foundation models that needs to be considered prior to the deployment of this form of model is demonstrated.

Abstract

Speech enabled foundation models, either in the form of flexible speech recognition based systems or audio-prompted large language models (LLMs), are becoming increasingly popular. One of the interesting aspects of these models is their ability to perform tasks other than automatic speech recognition (ASR) using an appropriate prompt. For example, the OpenAI Whisper model can perform both speech transcription and speech translation. With the development of audio-prompted LLMs there is the potential for even greater control options. In this work we demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks. Without any access to the model prompt it is possible to modify the behaviour of the system by appropriately changing the audio input. To illustrate this risk, we demonstrate that it is possible to prepend a short universal adversarial acoustic segment to any input speech signal to override the prompt setting of an ASR foundation model. Specifically, we successfully use a universal adversarial acoustic segment to control Whisper to always perform speech translation, despite being set to perform speech transcription. Overall, this work demonstrates a new form of adversarial attack on multi-tasking speech enabled foundation models that needs to be considered prior to the deployment of this form of model.

Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models

TL;DR

A new form of adversarial attack on multi-tasking speech enabled foundation models that needs to be considered prior to the deployment of this form of model is demonstrated.

Abstract

Speech enabled foundation models, either in the form of flexible speech recognition based systems or audio-prompted large language models (LLMs), are becoming increasingly popular. One of the interesting aspects of these models is their ability to perform tasks other than automatic speech recognition (ASR) using an appropriate prompt. For example, the OpenAI Whisper model can perform both speech transcription and speech translation. With the development of audio-prompted LLMs there is the potential for even greater control options. In this work we demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks. Without any access to the model prompt it is possible to modify the behaviour of the system by appropriately changing the audio input. To illustrate this risk, we demonstrate that it is possible to prepend a short universal adversarial acoustic segment to any input speech signal to override the prompt setting of an ASR foundation model. Specifically, we successfully use a universal adversarial acoustic segment to control Whisper to always perform speech translation, despite being set to perform speech transcription. Overall, this work demonstrates a new form of adversarial attack on multi-tasking speech enabled foundation models that needs to be considered prior to the deployment of this form of model.
Paper Structure (13 sections, 4 equations, 4 figures, 3 tables)

This paper contains 13 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: A short universal acoustic adversarial segment can be prepended to any input speech signal to control the behavior of a multi-task Automatic Speech Recognition (ASR) model. For example, Whisper's transcribe setting can be overridden such that it operates in its translate setting.
  • Figure 2: BLEU performance for recalled samples where samples are recalled if the model-control attack is successful or fails, as per the P(en). Curves for fr-en data samples.
  • Figure 3: P(en) (%) distribution
  • Figure 4: BLEU performance for recalled samples where samples are recalled if the model-control attack (strong) is successful or fails, as per P(en).