Table of Contents
Fetching ...

Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard

Yudong Yang, Xuezhen Zhang, Zhifeng Han, Siyin Wang, Jimin Zhuang, Zengrui Jin, Jing Shao, Guangzhi Sun, Chao Zhang

TL;DR

The paper shows that current safety mechanisms for multimodal LLMs focusing on text can be bypassed by complex audio inputs. It introduces SACRED-Bench, a benchmarking suite that exploits speech-audio composition through three mechanisms to systematically test model safety, and demonstrates substantial vulnerabilities in leading LLMs. To address these risks, the authors propose SALMONN-Guard, a specialized audio guard model that jointly analyzes speech, audio, and text, and they show it significantly reduces attack success rates on SACRED-Bench. The work highlights the necessity of audio-aware safeguards for safe multimodal AI deployment and provides a practical guard and benchmark to drive future research.

Abstract

Recent progress in large language models (LLMs) has enabled understanding of both speech and non-speech audio, but exposing new safety risks emerging from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech-Audio Composition for RED-teaming) to evaluate the robustness of LLMs under complex audio-based attacks. Unlike existing perturbation-based methods that rely on noise optimization or white-box access, SACRED-Bench exploits speech-audio composition mechanisms. SACRED-Bench adopts three mechanisms: (a) speech overlap and multi-speaker dialogue, which embeds harmful prompts beneath or alongside benign speech; (b) speech-audio mixture, which imply unsafe intent via non-speech audio alongside benign speech or audio; and (c) diverse spoken instruction formats (open-ended QA, yes/no) that evade text-only filters. Experiments show that, even Gemini 2.5 Pro, the state-of-the-art proprietary LLM, still exhibits 66% attack success rate in SACRED-Bench test set, exposing vulnerabilities under cross-modal, speech-audio composition attacks. To bridge this gap, we propose SALMONN-Guard, a safeguard LLM that jointly inspects speech, audio, and text for safety judgments, reducing attack success down to 20%. Our results highlight the need for audio-aware defenses for the safety of multimodal LLMs. The benchmark and SALMONN-Guard checkpoints can be found at https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench. Warning: this paper includes examples that may be offensive or harmful.

Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard

TL;DR

The paper shows that current safety mechanisms for multimodal LLMs focusing on text can be bypassed by complex audio inputs. It introduces SACRED-Bench, a benchmarking suite that exploits speech-audio composition through three mechanisms to systematically test model safety, and demonstrates substantial vulnerabilities in leading LLMs. To address these risks, the authors propose SALMONN-Guard, a specialized audio guard model that jointly analyzes speech, audio, and text, and they show it significantly reduces attack success rates on SACRED-Bench. The work highlights the necessity of audio-aware safeguards for safe multimodal AI deployment and provides a practical guard and benchmark to drive future research.

Abstract

Recent progress in large language models (LLMs) has enabled understanding of both speech and non-speech audio, but exposing new safety risks emerging from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech-Audio Composition for RED-teaming) to evaluate the robustness of LLMs under complex audio-based attacks. Unlike existing perturbation-based methods that rely on noise optimization or white-box access, SACRED-Bench exploits speech-audio composition mechanisms. SACRED-Bench adopts three mechanisms: (a) speech overlap and multi-speaker dialogue, which embeds harmful prompts beneath or alongside benign speech; (b) speech-audio mixture, which imply unsafe intent via non-speech audio alongside benign speech or audio; and (c) diverse spoken instruction formats (open-ended QA, yes/no) that evade text-only filters. Experiments show that, even Gemini 2.5 Pro, the state-of-the-art proprietary LLM, still exhibits 66% attack success rate in SACRED-Bench test set, exposing vulnerabilities under cross-modal, speech-audio composition attacks. To bridge this gap, we propose SALMONN-Guard, a safeguard LLM that jointly inspects speech, audio, and text for safety judgments, reducing attack success down to 20%. Our results highlight the need for audio-aware defenses for the safety of multimodal LLMs. The benchmark and SALMONN-Guard checkpoints can be found at https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench. Warning: this paper includes examples that may be offensive or harmful.

Paper Structure

This paper contains 26 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Illustration of SACRED-Bench compared to existing speech red-teaming or jailbreaking methods. Left: existing audio red-teaming approaches. Right: proposed SACRED-Bench with key design principles.
  • Figure 2: The distribution of harmful speech categories in SACRED-Bench by the number of samples.
  • Figure 3: Data creation pipeline of SACRED-Bench for (a). speech overalp, (b). multi-speaker dialogue, and (c). Speech-Audio Mixture.