Table of Contents
Fetching ...

Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models

Zirui Song, Qian Jiang, Mingxuan Cui, Mingzhe Li, Lang Gao, Zeyu Zhang, Zixiang Xu, Yanbo Wang, Chenxi Wang, Guangxian Ouyang, Zhenhao Chen, Xiuying Chen

TL;DR

This work tackles the safety evaluation of Large Audio Language Models (LAMs) under jailbreak attacks by introducing AJailBench, the first open benchmark for audio-based jailbreaks. It provides AJailBench-Base with 1,495 adversarial audio prompts converted from text prompts and AJailBench-APT+ that generates semantically preserved, dynamic perturbations using the Audio Perturbation Toolkit (APT) and Bayesian optimization, guided by a Semantic Consistency Constraint. Across seven diverse LAMs, the study shows no model achieves universal robustness, and semantically preserved audio perturbations can substantially degrade safety, signaling a need for semantically aware defenses. The benchmark, including open-source data and tools, offers a rigorous framework for assessing and improving audio safety in LAMs with implications for defense development and real-world deployment.

Abstract

The rise of Large Audio Language Models (LAMs) brings both potential and risks, as their audio outputs may contain harmful or unethical content. However, current research lacks a systematic, quantitative evaluation of LAM safety especially against jailbreak attacks, which are challenging due to the temporal and semantic nature of speech. To bridge this gap, we introduce AJailBench, the first benchmark specifically designed to evaluate jailbreak vulnerabilities in LAMs. We begin by constructing AJailBench-Base, a dataset of 1,495 adversarial audio prompts spanning 10 policy-violating categories, converted from textual jailbreak attacks using realistic text to speech synthesis. Using this dataset, we evaluate several state-of-the-art LAMs and reveal that none exhibit consistent robustness across attacks. To further strengthen jailbreak testing and simulate more realistic attack conditions, we propose a method to generate dynamic adversarial variants. Our Audio Perturbation Toolkit (APT) applies targeted distortions across time, frequency, and amplitude domains. To preserve the original jailbreak intent, we enforce a semantic consistency constraint and employ Bayesian optimization to efficiently search for perturbations that are both subtle and highly effective. This results in AJailBench-APT, an extended dataset of optimized adversarial audio samples. Our findings demonstrate that even small, semantically preserved perturbations can significantly reduce the safety performance of leading LAMs, underscoring the need for more robust and semantically aware defense mechanisms.

Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models

TL;DR

This work tackles the safety evaluation of Large Audio Language Models (LAMs) under jailbreak attacks by introducing AJailBench, the first open benchmark for audio-based jailbreaks. It provides AJailBench-Base with 1,495 adversarial audio prompts converted from text prompts and AJailBench-APT+ that generates semantically preserved, dynamic perturbations using the Audio Perturbation Toolkit (APT) and Bayesian optimization, guided by a Semantic Consistency Constraint. Across seven diverse LAMs, the study shows no model achieves universal robustness, and semantically preserved audio perturbations can substantially degrade safety, signaling a need for semantically aware defenses. The benchmark, including open-source data and tools, offers a rigorous framework for assessing and improving audio safety in LAMs with implications for defense development and real-world deployment.

Abstract

The rise of Large Audio Language Models (LAMs) brings both potential and risks, as their audio outputs may contain harmful or unethical content. However, current research lacks a systematic, quantitative evaluation of LAM safety especially against jailbreak attacks, which are challenging due to the temporal and semantic nature of speech. To bridge this gap, we introduce AJailBench, the first benchmark specifically designed to evaluate jailbreak vulnerabilities in LAMs. We begin by constructing AJailBench-Base, a dataset of 1,495 adversarial audio prompts spanning 10 policy-violating categories, converted from textual jailbreak attacks using realistic text to speech synthesis. Using this dataset, we evaluate several state-of-the-art LAMs and reveal that none exhibit consistent robustness across attacks. To further strengthen jailbreak testing and simulate more realistic attack conditions, we propose a method to generate dynamic adversarial variants. Our Audio Perturbation Toolkit (APT) applies targeted distortions across time, frequency, and amplitude domains. To preserve the original jailbreak intent, we enforce a semantic consistency constraint and employ Bayesian optimization to efficiently search for perturbations that are both subtle and highly effective. This results in AJailBench-APT, an extended dataset of optimized adversarial audio samples. Our findings demonstrate that even small, semantically preserved perturbations can significantly reduce the safety performance of leading LAMs, underscoring the need for more robust and semantically aware defense mechanisms.

Paper Structure

This paper contains 21 sections, 11 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: (a) Illustration of the audio jailbreak pipeline. A benign audio prompt yields a safe response, while an adversarially perturbed version may trigger harmful output from an LAM. Perturbations span time, frequency, and mixing domains. (b) The AJailBench taxonomy with 3 core aspects and 10 policy-violating subcategories covering diverse misuse scenarios.
  • Figure 2: Workflow of Semantic Consistency Constraint. Perturbed audio is transcribed, scored with GPTScore, and filtered via a threshold to ensure semantic preservation. Each parameter corresponds to a different perturbation type.
  • Figure 3: Semantic Consistency Constraint Experiment's visualization. (a) Energy Distribution Perturbation. (b) Pitch shifting. (c) Temporal Scaling (d) Perturbation Overlay Round.
  • Figure 4: Performance of existing LAMs across various aspects.
  • Figure 5: Sample distribution across 7 APT techniques in AJailBench-APT+, selected via Bayesian optimization.