Table of Contents
Fetching ...

AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models

Mintong Kang, Chejian Xu, Bo Li

TL;DR

AdvWave presents the first stealthy jailbreak framework for large audio-language models, addressing core challenges from non-differentiable audio encoders to model behavioral variability. It integrates a dual-phase optimization to bypass gradient shattering, an adaptive target search to tailor attacks to each model and query, and classifier-guided stealth to produce ear-credible adversarial audio. The approach achieves state-of-the-art jailbreak performance on white-box LALMs and near-perfection on the GPT-4o-S2S black-box API, while maintaining perceptual stealth. These results highlight the pressing need for robust safety alignment in LALMs and provide a practical testbed for evaluating defense strategies across audio modalities.

Abstract

Recent advancements in large audio-language models (LALMs) have enabled speech-based user interactions, significantly enhancing user experience and accelerating the deployment of LALMs in real-world applications. However, ensuring the safety of LALMs is crucial to prevent risky outputs that may raise societal concerns or violate AI regulations. Despite the importance of this issue, research on jailbreaking LALMs remains limited due to their recent emergence and the additional technical challenges they present compared to attacks on DNN-based audio models. Specifically, the audio encoders in LALMs, which involve discretization operations, often lead to gradient shattering, hindering the effectiveness of attacks relying on gradient-based optimizations. The behavioral variability of LALMs further complicates the identification of effective (adversarial) optimization targets. Moreover, enforcing stealthiness constraints on adversarial audio waveforms introduces a reduced, non-convex feasible solution space, further intensifying the challenges of the optimization process. To overcome these challenges, we develop AdvWave, the first jailbreak framework against LALMs. We propose a dual-phase optimization method that addresses gradient shattering, enabling effective end-to-end gradient-based optimization. Additionally, we develop an adaptive adversarial target search algorithm that dynamically adjusts the adversarial optimization target based on the response patterns of LALMs for specific queries. To ensure that adversarial audio remains perceptually natural to human listeners, we design a classifier-guided optimization approach that generates adversarial noise resembling common urban sounds. Extensive evaluations on multiple advanced LALMs demonstrate that AdvWave outperforms baseline methods, achieving a 40% higher average jailbreak attack success rate.

AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models

TL;DR

AdvWave presents the first stealthy jailbreak framework for large audio-language models, addressing core challenges from non-differentiable audio encoders to model behavioral variability. It integrates a dual-phase optimization to bypass gradient shattering, an adaptive target search to tailor attacks to each model and query, and classifier-guided stealth to produce ear-credible adversarial audio. The approach achieves state-of-the-art jailbreak performance on white-box LALMs and near-perfection on the GPT-4o-S2S black-box API, while maintaining perceptual stealth. These results highlight the pressing need for robust safety alignment in LALMs and provide a practical testbed for evaluating defense strategies across audio modalities.

Abstract

Recent advancements in large audio-language models (LALMs) have enabled speech-based user interactions, significantly enhancing user experience and accelerating the deployment of LALMs in real-world applications. However, ensuring the safety of LALMs is crucial to prevent risky outputs that may raise societal concerns or violate AI regulations. Despite the importance of this issue, research on jailbreaking LALMs remains limited due to their recent emergence and the additional technical challenges they present compared to attacks on DNN-based audio models. Specifically, the audio encoders in LALMs, which involve discretization operations, often lead to gradient shattering, hindering the effectiveness of attacks relying on gradient-based optimizations. The behavioral variability of LALMs further complicates the identification of effective (adversarial) optimization targets. Moreover, enforcing stealthiness constraints on adversarial audio waveforms introduces a reduced, non-convex feasible solution space, further intensifying the challenges of the optimization process. To overcome these challenges, we develop AdvWave, the first jailbreak framework against LALMs. We propose a dual-phase optimization method that addresses gradient shattering, enabling effective end-to-end gradient-based optimization. Additionally, we develop an adaptive adversarial target search algorithm that dynamically adjusts the adversarial optimization target based on the response patterns of LALMs for specific queries. To ensure that adversarial audio remains perceptually natural to human listeners, we design a classifier-guided optimization approach that generates adversarial noise resembling common urban sounds. Extensive evaluations on multiple advanced LALMs demonstrate that AdvWave outperforms baseline methods, achieving a 40% higher average jailbreak attack success rate.

Paper Structure

This paper contains 24 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: AdvWave presents a dual-phase optimization (\ref{['subsec:optimi']}) framework: (1) Phase I: Optimize the audio token vector ${\bf{I}}_A$ with the adversarial loss ${\mathcal{L}}_{\text{adv}}$ regarding the adversarial optimization target ${\bm{r}}_{\text{adv}}$ (\ref{['subsec:target']}); (2) Phase II: Optimize the input adversarial audio with retention loss ${\mathcal{L}}_{\text{retent}}$ regarding the optimum token vector in Phase I (${\bf{I}}_A^*$) and a stealthiness loss via classifier guidance (${\mathcal{L}}_{\text{stealth}}$ in \ref{['subsec:stealth']}).
  • Figure 2: Comparisons of ASR-W ($\uparrow$) and ASR-L ($\uparrow$) between AdvWave and other transfer-based attacks on SOTA black-box model GPT-4o-S2S API. The results demonstrate that AdvWave outperforms transfer-based attacks by a large margin and achieves nearly perfect ASRs.
  • Figure 3: Comparisons of ASR-W ($\uparrow$) and ASR-L ($\uparrow$) between AdvWave with a fixed adversarial optimization target "Sure!" (Fixed-Target) and AdvWave with adaptively searched adversarial targets as \ref{['subsec:target']} (Adaptive-Target). The results demonstrate that the adaptive target search benefits in achieving higher attack success rates on SpeechGPT, Qwen2-Audio, and Llama-Omni.
  • Figure 4: Comparisons of $\bm{S}_\text{stealth}$ ($\uparrow$) and ASR-L ($\uparrow$) between AdvWave without $\mathcal{L}_{\text{stealth}}$ stealthiness guidance (\ref{['subsec:stealth']}) and AdvWave with $\mathcal{L}_{\text{stealth}}$ guidance on Qwen2-Audio model. The results show that the stealthiness guidance effectively enhances the stealthiness score $\bm{S}_{\text{Stealth}}$ of jailbreak audio while maintaining similar attack success rates for different types of target environment noises.
  • Figure 5: Case study of AdvWave on the Qwen2-Audio model.