Table of Contents
Fetching ...

Multilingual and Multi-Accent Jailbreaking of Audio LLMs

Jaechul Roh, Virat Shejwalkar, Amir Houmansadr

TL;DR

This work reveals a critical security vulnerability in Large Audio Language Models (LALMs) where multilingual and multi-accent audio inputs, when combined with acoustic perturbations, dramatically increase jailbreak success. It introduces Multi-AudioJail, a two-stage framework that (i) builds a large adversarial dataset spanning languages and accents, and (ii) employs reverberation, echo, and whisper perturbations to expose cross-lingual weaknesses. The study shows that audio-based jailbreaks are more effective than text-based ones in many cases, with jailbreak success surging by substantial margins (up to about +57 percentage points in some instances) and multimodal LLMs showing particular vulnerability. A defense-in-depth, text-based mitigation is proposed and evaluated, highlighting the urgent need for cross-modal safety training and robust defenses in multimodal LLMs, as the attack surface expands with linguistic and acoustic diversity.

Abstract

Large Audio Language Models (LALMs) have significantly advanced audio understanding but introduce critical security risks, particularly through audio jailbreaks. While prior work has focused on English-centric attacks, we expose a far more severe vulnerability: adversarial multilingual and multi-accent audio jailbreaks, where linguistic and acoustic variations dramatically amplify attack success. In this paper, we introduce Multi-AudioJail, the first systematic framework to exploit these vulnerabilities through (1) a novel dataset of adversarially perturbed multilingual/multi-accent audio jailbreaking prompts, and (2) a hierarchical evaluation pipeline revealing that how acoustic perturbations (e.g., reverberation, echo, and whisper effects) interacts with cross-lingual phonetics to cause jailbreak success rates (JSRs) to surge by up to +57.25 percentage points (e.g., reverberated Kenyan-accented attack on MERaLiON). Crucially, our work further reveals that multimodal LLMs are inherently more vulnerable than unimodal systems: attackers need only exploit the weakest link (e.g., non-English audio inputs) to compromise the entire model, which we empirically show by multilingual audio-only attacks achieving 3.1x higher success rates than text-only attacks. We plan to release our dataset to spur research into cross-modal defenses, urging the community to address this expanding attack surface in multimodality as LALMs evolve.

Multilingual and Multi-Accent Jailbreaking of Audio LLMs

TL;DR

This work reveals a critical security vulnerability in Large Audio Language Models (LALMs) where multilingual and multi-accent audio inputs, when combined with acoustic perturbations, dramatically increase jailbreak success. It introduces Multi-AudioJail, a two-stage framework that (i) builds a large adversarial dataset spanning languages and accents, and (ii) employs reverberation, echo, and whisper perturbations to expose cross-lingual weaknesses. The study shows that audio-based jailbreaks are more effective than text-based ones in many cases, with jailbreak success surging by substantial margins (up to about +57 percentage points in some instances) and multimodal LLMs showing particular vulnerability. A defense-in-depth, text-based mitigation is proposed and evaluated, highlighting the urgent need for cross-modal safety training and robust defenses in multimodal LLMs, as the attack surface expands with linguistic and acoustic diversity.

Abstract

Large Audio Language Models (LALMs) have significantly advanced audio understanding but introduce critical security risks, particularly through audio jailbreaks. While prior work has focused on English-centric attacks, we expose a far more severe vulnerability: adversarial multilingual and multi-accent audio jailbreaks, where linguistic and acoustic variations dramatically amplify attack success. In this paper, we introduce Multi-AudioJail, the first systematic framework to exploit these vulnerabilities through (1) a novel dataset of adversarially perturbed multilingual/multi-accent audio jailbreaking prompts, and (2) a hierarchical evaluation pipeline revealing that how acoustic perturbations (e.g., reverberation, echo, and whisper effects) interacts with cross-lingual phonetics to cause jailbreak success rates (JSRs) to surge by up to +57.25 percentage points (e.g., reverberated Kenyan-accented attack on MERaLiON). Crucially, our work further reveals that multimodal LLMs are inherently more vulnerable than unimodal systems: attackers need only exploit the weakest link (e.g., non-English audio inputs) to compromise the entire model, which we empirically show by multilingual audio-only attacks achieving 3.1x higher success rates than text-only attacks. We plan to release our dataset to spur research into cross-modal defenses, urging the community to address this expanding attack surface in multimodality as LALMs evolve.

Paper Structure

This paper contains 40 sections, 1 equation, 6 figures, 15 tables.

Figures (6)

  • Figure 1: Overview of Multi-AudioJail
  • Figure 2: Average Text-Only versus Audio-Only JSRs across languages. We demonstrate that overall audio-only inputs yield higher JSRs compared to text-only inputs.
  • Figure 3: JSRs (%) plot for Natural and Synthetic Multi-Accent Audio Inputs. Natural accents generally yield generally lower JSRs (averaging around 2.54%) compared to synthetic accents that exhibit much higher JSRs (averaging around 11.42%).
  • Figure 4: Text-based Defense Prompt Template for Safe Query Handling. The prompt is translated into the corresponding language depending on the language of the audio input.
  • Figure 5: Example Generation of Qwen2-Audio with German audio input.
  • ...and 1 more figures