Table of Contents
Fetching ...

Rebellion: Noise-Robust Reasoning Training for Audio Reasoning Models

Tiansheng Huang, Virat Shejwalkar, Oscar Chang, Milad Nasr, Ling Liu

TL;DR

The study tackles safety for audio reasoning models (ARMs) trained with reasoning training (RT) by showing standard RT fails against advanced audio jailbreaks via representation drift. It introduces Rebellion, a robust RT that optimizes a min–max objective to endure worst-case drift in representations, validated on Qwen2-Audio-7B. Empirical results demonstrate that Rebellion significantly improves the safety-accuracy trade-off, reducing harmful responses to jailbreaks like AdvWave and Rephrasing while preserving benign reasoning performance, and reveals a "+think twice" safety behavior under heavy noise. This approach advances practical safety for ARMs and suggests a path toward more reliable audio-based reasoning under adversarial input conditions.

Abstract

Instilling reasoning capabilities in large models (LMs) using reasoning training (RT) significantly improves LMs' performances. Thus Audio Reasoning Models (ARMs), i.e., audio LMs that can reason, are becoming increasingly popular. However, no work has studied the safety of ARMs against jailbreak attacks that aim to elicit harmful responses from target models. To this end, first, we show that standard RT with appropriate safety reasoning data can protect ARMs from vanilla audio jailbreaks, but cannot protect them against our proposed simple yet effective jailbreaks. We show that this is because of the significant representation drift between vanilla and advanced jailbreaks which forces the target ARMs to emit harmful responses. Based on this observation, we propose Rebellion, a robust RT that trains ARMs to be robust to the worst-case representation drift. All our results are on Qwen2-Audio; they demonstrate that Rebellion: 1) can protect against advanced audio jailbreaks without compromising performance on benign tasks, and 2) significantly improves accuracy-safety trade-off over standard RT method.

Rebellion: Noise-Robust Reasoning Training for Audio Reasoning Models

TL;DR

The study tackles safety for audio reasoning models (ARMs) trained with reasoning training (RT) by showing standard RT fails against advanced audio jailbreaks via representation drift. It introduces Rebellion, a robust RT that optimizes a min–max objective to endure worst-case drift in representations, validated on Qwen2-Audio-7B. Empirical results demonstrate that Rebellion significantly improves the safety-accuracy trade-off, reducing harmful responses to jailbreaks like AdvWave and Rephrasing while preserving benign reasoning performance, and reveals a "+think twice" safety behavior under heavy noise. This approach advances practical safety for ARMs and suggests a path toward more reliable audio-based reasoning under adversarial input conditions.

Abstract

Instilling reasoning capabilities in large models (LMs) using reasoning training (RT) significantly improves LMs' performances. Thus Audio Reasoning Models (ARMs), i.e., audio LMs that can reason, are becoming increasingly popular. However, no work has studied the safety of ARMs against jailbreak attacks that aim to elicit harmful responses from target models. To this end, first, we show that standard RT with appropriate safety reasoning data can protect ARMs from vanilla audio jailbreaks, but cannot protect them against our proposed simple yet effective jailbreaks. We show that this is because of the significant representation drift between vanilla and advanced jailbreaks which forces the target ARMs to emit harmful responses. Based on this observation, we propose Rebellion, a robust RT that trains ARMs to be robust to the worst-case representation drift. All our results are on Qwen2-Audio; they demonstrate that Rebellion: 1) can protect against advanced audio jailbreaks without compromising performance on benign tasks, and 2) significantly improves accuracy-safety trade-off over standard RT method.

Paper Structure

This paper contains 12 sections, 4 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: RT and Rebellion against a vanilla harmful query and an audio jailbreak query (Advwave). Audio Reasoning models (ARM) trained using either standard reasoning training (RT) or Rebellion correctly refuse vanilla harmful questions. But, RT complies with audio jailbreaks optimized to circumvent safety guardrails and fails to provide safety reasoning. In contrast, Rebellion exhibits "think twice" phenomenon— it starts its response implying compliance to the jailbreak question, but then correctly provides safety reasoning and ultimately leads to refusal to the question. See Section \ref{['results']} for more discussion.
  • Figure 2: Illustration of representations drift under Advwave. Each red (blue) point represents the last token’s representation of a vanilla harmful (jailbreak) query. The jailbreak prompts (which contains original harmful prompt+audio noise) incur representation drift compared to original harmful prompts.
  • Figure 3: Qualitative example explaining "think twice" behavior of Rebellion when encountering longer suffix audio jailbreak.