Table of Contents
Fetching ...

AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models

Guangke Chen, Fu Song, Zhe Zhao, Xiaojun Jia, Yang Liu, Yanchen Qiao, Weizhe Zhang

TL;DR

AudioJailbreak introduces an audio-based jailbreak against end-to-end LALMs, addressing limitations of porting text-based attacks via TTS (ASR around $9.1\%$ on end-to-end LALMs vs $42.7\%$ on text-modality LLMs). It employs suffixal jailbreak audios, universal perturbations across prompts, stealthy audio camouflage, and reverberation-based over-the-air robustness to survive air-channel distortions. Extensive experiments across $>10$ end-to-end LALMs and two datasets show high ASR in both strong and weak adversary settings (up to $\geq 87\%$ strong and near $100\%$ weak for sample-specific; air-channel ASR around $80\%$ for strong with RIR), plus transferability to unseen models and substantial stealthiness without alerting users. The work highlights critical security implications for LALMs and motivates the development of audio-domain defenses to bolster robustness in real-world, over-the-air scenarios.

Abstract

Jailbreak attacks to Large audio-language models (LALMs) are studied recently, but they achieve suboptimal effectiveness, applicability, and practicability, particularly, assuming that the adversary can fully manipulate user prompts. In this work, we first conduct an extensive experiment showing that advanced text jailbreak attacks cannot be easily ported to end-to-end LALMs via text-to speech (TTS) techniques. We then propose AudioJailbreak, a novel audio jailbreak attack, featuring (1) asynchrony: the jailbreak audio does not need to align with user prompts in the time axis by crafting suffixal jailbreak audios; (2) universality: a single jailbreak perturbation is effective for different prompts by incorporating multiple prompts into perturbation generation; (3) stealthiness: the malicious intent of jailbreak audios will not raise the awareness of victims by proposing various intent concealment strategies; and (4) over-the-air robustness: the jailbreak audios remain effective when being played over the air by incorporating the reverberation distortion effect with room impulse response into the generation of the perturbations. In contrast, all prior audio jailbreak attacks cannot offer asynchrony, universality, stealthiness, or over-the-air robustness. Moreover, AudioJailbreak is also applicable to the adversary who cannot fully manipulate user prompts, thus has a much broader attack scenario. Extensive experiments with thus far the most LALMs demonstrate the high effectiveness of AudioJailbreak. We highlight that our work peeks into the security implications of audio jailbreak attacks against LALMs, and realistically fosters improving their security robustness. The implementation and audio samples are available at our website https://audiojailbreak.github.io/AudioJailbreak.

AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models

TL;DR

AudioJailbreak introduces an audio-based jailbreak against end-to-end LALMs, addressing limitations of porting text-based attacks via TTS (ASR around on end-to-end LALMs vs on text-modality LLMs). It employs suffixal jailbreak audios, universal perturbations across prompts, stealthy audio camouflage, and reverberation-based over-the-air robustness to survive air-channel distortions. Extensive experiments across end-to-end LALMs and two datasets show high ASR in both strong and weak adversary settings (up to strong and near weak for sample-specific; air-channel ASR around for strong with RIR), plus transferability to unseen models and substantial stealthiness without alerting users. The work highlights critical security implications for LALMs and motivates the development of audio-domain defenses to bolster robustness in real-world, over-the-air scenarios.

Abstract

Jailbreak attacks to Large audio-language models (LALMs) are studied recently, but they achieve suboptimal effectiveness, applicability, and practicability, particularly, assuming that the adversary can fully manipulate user prompts. In this work, we first conduct an extensive experiment showing that advanced text jailbreak attacks cannot be easily ported to end-to-end LALMs via text-to speech (TTS) techniques. We then propose AudioJailbreak, a novel audio jailbreak attack, featuring (1) asynchrony: the jailbreak audio does not need to align with user prompts in the time axis by crafting suffixal jailbreak audios; (2) universality: a single jailbreak perturbation is effective for different prompts by incorporating multiple prompts into perturbation generation; (3) stealthiness: the malicious intent of jailbreak audios will not raise the awareness of victims by proposing various intent concealment strategies; and (4) over-the-air robustness: the jailbreak audios remain effective when being played over the air by incorporating the reverberation distortion effect with room impulse response into the generation of the perturbations. In contrast, all prior audio jailbreak attacks cannot offer asynchrony, universality, stealthiness, or over-the-air robustness. Moreover, AudioJailbreak is also applicable to the adversary who cannot fully manipulate user prompts, thus has a much broader attack scenario. Extensive experiments with thus far the most LALMs demonstrate the high effectiveness of AudioJailbreak. We highlight that our work peeks into the security implications of audio jailbreak attacks against LALMs, and realistically fosters improving their security robustness. The implementation and audio samples are available at our website https://audiojailbreak.github.io/AudioJailbreak.

Paper Structure

This paper contains 34 sections, 8 equations, 9 figures, 7 tables, 2 algorithms.

Figures (9)

  • Figure 1: Overview of AudioJailbreak: strong adversary vs. weak adversary.
  • Figure 2: Comparison of the effectiveness of the sample-specific attacks for the strong adversary.
  • Figure 3: Results of the universality of AudioJailbreak.
  • Figure 4: Results of over-the-air attacks.
  • Figure 5: Objective results of the stealthiness.
  • ...and 4 more figures