RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models

Xiqiao Xiong; Ouxiang Li; Zhuo Liu; Moxin Li; Wentao Shi; Fuli Feng; Xiangnan He

RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models

Xiqiao Xiong, Ouxiang Li, Zhuo Liu, Moxin Li, Wentao Shi, Fuli Feng, Xiangnan He

TL;DR

This work tackles the vulnerability of black-box LLMs to multi-turn jailbreaks by reframing attacker training as a trajectory-level reinforcement learning problem. It introduces RL-MTJail, which adds two heuristic process rewards—over-harm mitigation and target-guided progression—to mitigate sparse supervision and promote long-horizon strategies. Empirical results across HarmBench, StrongREJECT, and JailbreakBench show substantial improvements in attack success rates and robust transferability across victim models, with ablation confirming the value of the process rewards. The approach advances understanding of jailbreak dynamics and offers a framework for developing and evaluating defenses against trajectory-based prompts in real-world LLM deployments.

Abstract

Large language models are vulnerable to jailbreak attacks, threatening their safe deployment in real-world applications. This paper studies black-box multi-turn jailbreaks, aiming to train attacker LLMs to elicit harmful content from black-box models through a sequence of prompt-output interactions. Existing approaches typically rely on single turn optimization, which is insufficient for learning long-term attack strategies. To bridge this gap, we formulate the problem as a multi-turn reinforcement learning task, directly optimizing the harmfulness of the final-turn output as the outcome reward. To mitigate sparse supervision and promote long-term attack strategies, we propose two heuristic process rewards: (1) controlling the harmfulness of intermediate outputs to prevent triggering the black-box model's rejection mechanisms, and (2) maintaining the semantic relevance of intermediate outputs to avoid drifting into irrelevant content. Experimental results on multiple benchmarks show consistently improved attack success rates across multiple models, highlighting the effectiveness of our approach. The code is available at https://github.com/xxiqiao/RL-MTJail. Warning: This paper contains examples of harmful content.

RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models

TL;DR

Abstract

RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)