Table of Contents
Fetching ...

Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models

Zonghao Ying, Deyue Zhang, Zonglei Jing, Yisong Xiao, Quanchen Zou, Aishan Liu, Siyuan Liang, Xiangzheng Zhang, Xianglong Liu, Dacheng Tao

TL;DR

This work addresses safety vulnerabilities in multi-turn jailbreaking of LLMs by reframing harmful prompts as benign reasoning tasks within a structured Attack State Machine (ASM). It introduces Gain-guided Exploration, Self-play, and Rejection Feedback modules to maintain semantics and accelerate progress, achieving state-of-the-art attack success rates across multiple models. On AdvBench and HarmBench, RACE reaches ASRs up to 96% and reveals significant gaps in current safety alignments, especially for reasoning-enabled models. The study provides a rigorous framework for evaluating LLM safety and offers insights for designing robust defenses against reasoning-driven jailbreaks.

Abstract

Multi-turn jailbreak attacks simulate real-world human interactions by engaging large language models (LLMs) in iterative dialogues, exposing critical safety vulnerabilities. However, existing methods often struggle to balance semantic coherence with attack effectiveness, resulting in either benign semantic drift or ineffective detection evasion. To address this challenge, we propose Reasoning-Augmented Conversation, a novel multi-turn jailbreak framework that reformulates harmful queries into benign reasoning tasks and leverages LLMs' strong reasoning capabilities to compromise safety alignment. Specifically, we introduce an attack state machine framework to systematically model problem translation and iterative reasoning, ensuring coherent query generation across multiple turns. Building on this framework, we design gain-guided exploration, self-play, and rejection feedback modules to preserve attack semantics, enhance effectiveness, and sustain reasoning-driven attack progression. Extensive experiments on multiple LLMs demonstrate that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios, with attack success rates (ASRs) increasing by up to 96%. Notably, our approach achieves ASRs of 82% and 92% against leading commercial models, OpenAI o1 and DeepSeek R1, underscoring its potency. We release our code at https://github.com/NY1024/RACE to facilitate further research in this critical domain.

Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models

TL;DR

This work addresses safety vulnerabilities in multi-turn jailbreaking of LLMs by reframing harmful prompts as benign reasoning tasks within a structured Attack State Machine (ASM). It introduces Gain-guided Exploration, Self-play, and Rejection Feedback modules to maintain semantics and accelerate progress, achieving state-of-the-art attack success rates across multiple models. On AdvBench and HarmBench, RACE reaches ASRs up to 96% and reveals significant gaps in current safety alignments, especially for reasoning-enabled models. The study provides a rigorous framework for evaluating LLM safety and offers insights for designing robust defenses against reasoning-driven jailbreaks.

Abstract

Multi-turn jailbreak attacks simulate real-world human interactions by engaging large language models (LLMs) in iterative dialogues, exposing critical safety vulnerabilities. However, existing methods often struggle to balance semantic coherence with attack effectiveness, resulting in either benign semantic drift or ineffective detection evasion. To address this challenge, we propose Reasoning-Augmented Conversation, a novel multi-turn jailbreak framework that reformulates harmful queries into benign reasoning tasks and leverages LLMs' strong reasoning capabilities to compromise safety alignment. Specifically, we introduce an attack state machine framework to systematically model problem translation and iterative reasoning, ensuring coherent query generation across multiple turns. Building on this framework, we design gain-guided exploration, self-play, and rejection feedback modules to preserve attack semantics, enhance effectiveness, and sustain reasoning-driven attack progression. Extensive experiments on multiple LLMs demonstrate that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios, with attack success rates (ASRs) increasing by up to 96%. Notably, our approach achieves ASRs of 82% and 92% against leading commercial models, OpenAI o1 and DeepSeek R1, underscoring its potency. We release our code at https://github.com/NY1024/RACE to facilitate further research in this critical domain.

Paper Structure

This paper contains 36 sections, 7 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Illustration of RACE. RACE transforms the harmful query into a benign reasoning task and processes it over subsequent conversation turns. During this process, the LLM gradually engages in step-by-step reasoning, ultimately leading to self-jailbreak.
  • Figure 2: Overall attack process and framework. RACE achieves a jailbreak by transforming the target query into a reasoning task and conducting multi-turn reasoning. The entire attack process is modeled as an ASM and optimized using the three proposed modules.
  • Figure 3: ASR (%) of different attacks against leading commercial reasoning LLMs.
  • Figure 4: Attack performance under different numbers of conversation turns.
  • Figure 5: Impact of different reasoning types.
  • ...and 6 more figures