Backdoor Attacks Against Speech Language Models
Alexandrine Fortier, Thomas Thebaud, Jesús Villalba, Najim Dehak, Patrick Cardinal
TL;DR
This work systematically investigates backdoor attacks in speech language models by studying a modular SpeechLLM pipeline (audio encoder → connector → LLM with LoRA). A 220 ms typewriter-click trigger is used to poison a subset of training data, enabling high attack effectiveness (AER) across transcription, gender, emotion, and age tasks and multiple encoders (WavLM, HuBERT, wav2vec 2.0, Whisper). A detailed component-level analysis shows the audio encoder as the central conduit for backdoor propagation, with varying persistence across tasks; ASR is notably more resistant to component-based attacks. The study also demonstrates that post-training fine-tuning on clean data can erase the backdoor while preserving benign performance, though cross-dataset fine-tuning may incur forgetting. Overall, the findings illuminate how backdoors propagate through multimodal pipelines and point toward practical defenses and future research directions in securing speech-language foundations.
Abstract
Large Language Models (LLMs) and their multimodal extensions are becoming increasingly popular. One common approach to enable multimodality is to cascade domain-specific encoders with an LLM, making the resulting model inherit vulnerabilities from all of its components. In this work, we present the first systematic study of audio backdoor attacks against speech language models. We demonstrate its effectiveness across four speech encoders and three datasets, covering four tasks: automatic speech recognition (ASR), speech emotion recognition, and gender and age prediction. The attack consistently achieves high success rates, ranging from 90.76% to 99.41%. To better understand how backdoors propagate, we conduct a component-wise analysis to identify the most vulnerable stages of the pipeline. Finally, we propose a fine-tuning-based defense that mitigates the threat of poisoned pretrained encoders.
