Table of Contents
Fetching ...

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

Sung-Hoon Yoon, Ruizhi Qian, Minda Zhao, Weiyue Li, Mengyu Wang

TL;DR

TrailBlazer advances jailbreaking research by treating LLM attacks as history-aware sequential decisions. It introduces History-augmented Reinforcement Learning (HRL) and Attention-based HRL (AHRL) to explicitly leverage past prompts, responses, rewards, and actions, guided by an attention mechanism that highlights the most informative earlier steps. Across AdvBench and HarmBench, TrailBlazer achieves state-of-the-art Attack Success Rate with substantially lower Queries Per Success, demonstrating both higher effectiveness and greater query efficiency. The framework generalizes across diverse open LLMs and shows solid transferability, underscoring the importance of historical vulnerability signals for reinforcement-learning-driven adversarial evaluation and guiding safeguards.

Abstract

Large Language Models (LLMs) have become integral to many domains, making their safety a critical priority. Prior jailbreaking research has explored diverse approaches, including prompt optimization, automated red teaming, obfuscation, and reinforcement learning (RL) based methods. However, most existing techniques fail to effectively leverage vulnerabilities revealed in earlier interaction turns, resulting in inefficient and unstable attacks. Since jailbreaking involves sequential interactions in which each response influences future actions, reinforcement learning provides a natural framework for this problem. Motivated by this, we propose a history-aware RL-based jailbreak framework that analyzes and reweights vulnerability signals from prior steps to guide future decisions. We show that incorporating historical information alone improves jailbreak success rates. Building on this insight, we introduce an attention-based reweighting mechanism that highlights critical vulnerabilities within the interaction history, enabling more efficient exploration with fewer queries. Extensive experiments on AdvBench and HarmBench demonstrate that our method achieves state-of-the-art jailbreak performance while significantly improving query efficiency. These results underscore the importance of historical vulnerability signals in reinforcement learning-driven jailbreak strategies and offer a principled pathway for advancing adversarial research on LLM safeguards.

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

TL;DR

TrailBlazer advances jailbreaking research by treating LLM attacks as history-aware sequential decisions. It introduces History-augmented Reinforcement Learning (HRL) and Attention-based HRL (AHRL) to explicitly leverage past prompts, responses, rewards, and actions, guided by an attention mechanism that highlights the most informative earlier steps. Across AdvBench and HarmBench, TrailBlazer achieves state-of-the-art Attack Success Rate with substantially lower Queries Per Success, demonstrating both higher effectiveness and greater query efficiency. The framework generalizes across diverse open LLMs and shows solid transferability, underscoring the importance of historical vulnerability signals for reinforcement-learning-driven adversarial evaluation and guiding safeguards.

Abstract

Large Language Models (LLMs) have become integral to many domains, making their safety a critical priority. Prior jailbreaking research has explored diverse approaches, including prompt optimization, automated red teaming, obfuscation, and reinforcement learning (RL) based methods. However, most existing techniques fail to effectively leverage vulnerabilities revealed in earlier interaction turns, resulting in inefficient and unstable attacks. Since jailbreaking involves sequential interactions in which each response influences future actions, reinforcement learning provides a natural framework for this problem. Motivated by this, we propose a history-aware RL-based jailbreak framework that analyzes and reweights vulnerability signals from prior steps to guide future decisions. We show that incorporating historical information alone improves jailbreak success rates. Building on this insight, we introduce an attention-based reweighting mechanism that highlights critical vulnerabilities within the interaction history, enabling more efficient exploration with fewer queries. Extensive experiments on AdvBench and HarmBench demonstrate that our method achieves state-of-the-art jailbreak performance while significantly improving query efficiency. These results underscore the importance of historical vulnerability signals in reinforcement learning-driven jailbreak strategies and offer a principled pathway for advancing adversarial research on LLM safeguards.
Paper Structure (31 sections, 4 equations, 2 figures, 7 tables)

This paper contains 31 sections, 4 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Overall framework of TrailBlazer. At iteration $t$, the RL agent $\pi$ observes a history-aware state $\hat{s}^{(t)}$ formed by the current prompt embedding $\phi(p^{(t)})$ and a summary of recent interactions $\tilde{h}^{(t)}$. The RL agent $\pi$ selects a discrete mutator $a^{(t)}\!\in\!\{Crossover,Expand,Rephrase,Generate,Shorten\}$, executed by a helper LLM to update the template $m^{(t)}\!\rightarrow\!m^{(t+1)}$. The updated template $m^{(t+1)}$ with the fixed query $q$ yields the next prompt, which elicits response $u^{(t)}$ from the target LLM. The response is scored into reward $r^{(t)}$ and response features $y^{(t)}$ (refusal, perplexity, length, toxicity), stored as $h^{(t)}=[\phi(p^{(t)}),a^{(t)},y^{(t)},r^{(t)}]$. Attention over past $\{h^{(t-i)}\}$ queried by $\phi(p^{(t)})$, reweights prior steps to guide subsequent actions.
  • Figure 2: Value loss during PPO training for Qwen3-14B (dark blue) and GPT-oss-20B (sky blue). Both models show fast initial decrease and stable convergence, indicating successful learning of the value function.