Table of Contents
Fetching ...

When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, Bodam Kim, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin

TL;DR

This research introduces WhisperInject, a two-stage adversarial audio attack framework that can manipulate state-of-the-art audio language models to generate harmful content through imperceptible perturbations in audio inputs that remain benign to human listeners.

Abstract

As large language models (LLMs) become increasingly integrated into daily life, audio has emerged as a key interface for human-AI interaction. However, this convenience also introduces new vulnerabilities, making audio a potential attack surface for adversaries. Our research introduces WhisperInject, a two-stage adversarial audio attack framework that manipulates state-of-the-art audio language models to generate harmful content. Our method embeds harmful payloads as subtle perturbations into audio inputs that remain intelligible to human listeners. The first stage uses a novel reward-based white-box optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to jailbreak the target model and elicit harmful native responses. This native harmful response then serves as the target for Stage 2, Payload Injection, where we use gradient-based optimization to embed subtle perturbations into benign audio carriers, such as weather queries or greeting messages. Our method achieves average attack success rates of 60-78% across two benchmarks and five multimodal LLMs, validated by multiple evaluation frameworks. Our work demonstrates a new class of practical, audio-native threats, moving beyond theoretical exploits to reveal a feasible and covert method for manipulating multimodal AI systems.

When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

TL;DR

This research introduces WhisperInject, a two-stage adversarial audio attack framework that can manipulate state-of-the-art audio language models to generate harmful content through imperceptible perturbations in audio inputs that remain benign to human listeners.

Abstract

As large language models (LLMs) become increasingly integrated into daily life, audio has emerged as a key interface for human-AI interaction. However, this convenience also introduces new vulnerabilities, making audio a potential attack surface for adversaries. Our research introduces WhisperInject, a two-stage adversarial audio attack framework that manipulates state-of-the-art audio language models to generate harmful content. Our method embeds harmful payloads as subtle perturbations into audio inputs that remain intelligible to human listeners. The first stage uses a novel reward-based white-box optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to jailbreak the target model and elicit harmful native responses. This native harmful response then serves as the target for Stage 2, Payload Injection, where we use gradient-based optimization to embed subtle perturbations into benign audio carriers, such as weather queries or greeting messages. Our method achieves average attack success rates of 60-78% across two benchmarks and five multimodal LLMs, validated by multiple evaluation frameworks. Our work demonstrates a new class of practical, audio-native threats, moving beyond theoretical exploits to reveal a feasible and covert method for manipulating multimodal AI systems.

Paper Structure

This paper contains 40 sections, 8 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: A conceptual illustration of our attack scenario. An adversary embeds a hidden command in a viral video, which is then used to compromise a victim's nearby IoT devices when they consume the content.
  • Figure 2: Overview of the WhisperInject attack. Left (Stage 1): Native Target Discovery via RL-PGD. The model's response evolves from full refusal (Reward: 1) through softened stance (Reward: 5) to successful jailbreak (Reward: 10). Right (Stage 2): Payload Injection. The discovered native response is embedded as a subtle perturbation into benign audio. A human hears "How's the weather today?" with minimal distortion, while the ALM outputs the malicious content.
  • Figure 3: Conceptual comparison of Standard PGD vs. our RL-PGD. Standard PGD follows a single gradient towards a fixed point. Our RL-PGD adaptively explores multiple paths, using rewards to compute a weighted search direction towards a broader target region.
  • Figure 4: Visual analysis of adversarial audio in the time and frequency domains. The perturbation is minimally invasive in the waveform (a) but is structured and distributed across the Mel spectrogram (b), ensuring stealth while effectively manipulating the model.