Table of Contents
Fetching ...

JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models

Zifan Peng, Yule Liu, Zhen Sun, Mingchen Li, Zeren Luo, Jingyi Zheng, Wenhan Dong, Xinlei He, Xuechao Wang, Yingjie Xue, Shengmin Xu, Xinyi Huang

TL;DR

JALMBench addresses the problem of evaluating jailbreaking risks in Audio Language Models (ALMs) by providing a large-scale, modular benchmark that jointly assesses text- and audio-based attacks. The method combines a dataset of 245,355 audio samples and 11,316 text samples with a framework supporting 12 ALMs, 8 attack types, and 5 defenses, enabling cross-model and cross-attack analyses. Key findings show audio-originated attacks, particularly AdvWave, can yield near-perfect jailbreak success, while many defenses have limited impact and trade off model utility. The work highlights architecture and dataset design as critical factors for cross-modal safety and motivates further research into audio-specific defenses and robust ALM alignment.

Abstract

Audio Language Models (ALMs) have made significant progress recently. These models integrate the audio modality directly into the model, rather than converting speech into text and inputting text to Large Language Models (LLMs). While jailbreak attacks on LLMs have been extensively studied, the security of ALMs with audio modalities remains largely unexplored. Currently, there is a lack of an adversarial audio dataset and a unified framework specifically designed to evaluate and compare attacks and ALMs. In this paper, we present JALMBench, a comprehensive benchmark to assess the safety of ALMs against jailbreak attacks. JALMBench includes a dataset containing 11,316 text samples and 245,355 audio samples with over 1,000 hours. It supports 12 mainstream ALMs, 4 text-transferred and 4 audio-originated attack methods, and 5 defense methods. Using JALMBench, we provide an in-depth analysis of attack efficiency, topic sensitivity, voice diversity, and architecture. Additionally, we explore mitigation strategies for the attacks at both the prompt level and the response level.

JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models

TL;DR

JALMBench addresses the problem of evaluating jailbreaking risks in Audio Language Models (ALMs) by providing a large-scale, modular benchmark that jointly assesses text- and audio-based attacks. The method combines a dataset of 245,355 audio samples and 11,316 text samples with a framework supporting 12 ALMs, 8 attack types, and 5 defenses, enabling cross-model and cross-attack analyses. Key findings show audio-originated attacks, particularly AdvWave, can yield near-perfect jailbreak success, while many defenses have limited impact and trade off model utility. The work highlights architecture and dataset design as critical factors for cross-modal safety and motivates further research into audio-specific defenses and robust ALM alignment.

Abstract

Audio Language Models (ALMs) have made significant progress recently. These models integrate the audio modality directly into the model, rather than converting speech into text and inputting text to Large Language Models (LLMs). While jailbreak attacks on LLMs have been extensively studied, the security of ALMs with audio modalities remains largely unexplored. Currently, there is a lack of an adversarial audio dataset and a unified framework specifically designed to evaluate and compare attacks and ALMs. In this paper, we present JALMBench, a comprehensive benchmark to assess the safety of ALMs against jailbreak attacks. JALMBench includes a dataset containing 11,316 text samples and 245,355 audio samples with over 1,000 hours. It supports 12 mainstream ALMs, 4 text-transferred and 4 audio-originated attack methods, and 5 defense methods. Using JALMBench, we provide an in-depth analysis of attack efficiency, topic sensitivity, voice diversity, and architecture. Additionally, we explore mitigation strategies for the attacks at both the prompt level and the response level.

Paper Structure

This paper contains 40 sections, 4 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: The framework and summary of $\mathsf{JALMBench}$.
  • Figure 2: Attack Efficiency: Attack located on the upper-left is better. Individual model timings are shown as transparent dots.
  • Figure 3: Effect of Topics: The average ASR (%) for each topic under the $A_{Harm}$ and eight attack methods among twelve ALMs.
  • Figure 4: ASR Across Languages: Average ASR for each language over all ALMs.
  • Figure 5: Effect of Architecture: A visualization of benign, harmful, and adversarial (PAP) queries' last hidden layer's representation in backbone LLM with t-SNE.
  • ...and 3 more figures