Table of Contents
Fetching ...

Backdoor Attacks Against Speech Language Models

Alexandrine Fortier, Thomas Thebaud, Jesús Villalba, Najim Dehak, Patrick Cardinal

TL;DR

This work systematically investigates backdoor attacks in speech language models by studying a modular SpeechLLM pipeline (audio encoder → connector → LLM with LoRA). A 220 ms typewriter-click trigger is used to poison a subset of training data, enabling high attack effectiveness (AER) across transcription, gender, emotion, and age tasks and multiple encoders (WavLM, HuBERT, wav2vec 2.0, Whisper). A detailed component-level analysis shows the audio encoder as the central conduit for backdoor propagation, with varying persistence across tasks; ASR is notably more resistant to component-based attacks. The study also demonstrates that post-training fine-tuning on clean data can erase the backdoor while preserving benign performance, though cross-dataset fine-tuning may incur forgetting. Overall, the findings illuminate how backdoors propagate through multimodal pipelines and point toward practical defenses and future research directions in securing speech-language foundations.

Abstract

Large Language Models (LLMs) and their multimodal extensions are becoming increasingly popular. One common approach to enable multimodality is to cascade domain-specific encoders with an LLM, making the resulting model inherit vulnerabilities from all of its components. In this work, we present the first systematic study of audio backdoor attacks against speech language models. We demonstrate its effectiveness across four speech encoders and three datasets, covering four tasks: automatic speech recognition (ASR), speech emotion recognition, and gender and age prediction. The attack consistently achieves high success rates, ranging from 90.76% to 99.41%. To better understand how backdoors propagate, we conduct a component-wise analysis to identify the most vulnerable stages of the pipeline. Finally, we propose a fine-tuning-based defense that mitigates the threat of poisoned pretrained encoders.

Backdoor Attacks Against Speech Language Models

TL;DR

This work systematically investigates backdoor attacks in speech language models by studying a modular SpeechLLM pipeline (audio encoder → connector → LLM with LoRA). A 220 ms typewriter-click trigger is used to poison a subset of training data, enabling high attack effectiveness (AER) across transcription, gender, emotion, and age tasks and multiple encoders (WavLM, HuBERT, wav2vec 2.0, Whisper). A detailed component-level analysis shows the audio encoder as the central conduit for backdoor propagation, with varying persistence across tasks; ASR is notably more resistant to component-based attacks. The study also demonstrates that post-training fine-tuning on clean data can erase the backdoor while preserving benign performance, though cross-dataset fine-tuning may incur forgetting. Overall, the findings illuminate how backdoors propagate through multimodal pipelines and point toward practical defenses and future research directions in securing speech-language foundations.

Abstract

Large Language Models (LLMs) and their multimodal extensions are becoming increasingly popular. One common approach to enable multimodality is to cascade domain-specific encoders with an LLM, making the resulting model inherit vulnerabilities from all of its components. In this work, we present the first systematic study of audio backdoor attacks against speech language models. We demonstrate its effectiveness across four speech encoders and three datasets, covering four tasks: automatic speech recognition (ASR), speech emotion recognition, and gender and age prediction. The attack consistently achieves high success rates, ranging from 90.76% to 99.41%. To better understand how backdoors propagate, we conduct a component-wise analysis to identify the most vulnerable stages of the pipeline. Finally, we propose a fine-tuning-based defense that mitigates the threat of poisoned pretrained encoders.

Paper Structure

This paper contains 23 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: SpeechLLM pipeline with poisoning mechanism (adapted from speechllm2024github). The poisoned audio sample is fed into the speech encoder. When a task is poisoned (e.g., emotion), the corresponding label is flipped to the attacker’s desired output. Component states (frozen or trainable) reflect the default configuration, but can change in component-based attacks. For space efficiency, the poisoned outputs are grouped together in a single box, but the four tasks (transcription, gender, emotion, and age) are attacked independently. Trigger size shown for illustration; not to scale with intensity.