Table of Contents
Fetching ...

RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs

Xuan Chen, Yuzhou Nie, Lu Yan, Yunshu Mao, Wenbo Guo, Xiangyu Zhang

TL;DR

This work tackles the problem of safety-alignment bypass in LLMs by introducing RL-JACK, a deep reinforcement learning framework that treats jailbreaking prompt generation as a deterministic search in a black-box setting. A novel RL agent selects among ten jailbreaking strategies, guided by a helper LLM, to produce diverse and effective prompts; a dense cosine-similarity reward against an unaligned reference provides continuous feedback, while a tailored state-transition scheme maintains stability. PPO-based training yields policies that outperform state-of-the-art automatic and in-context methods across six LLMs and exhibit resilience to defenses and transferability between models. The study also presents thorough ablations and sensitivity analyses, strengthening the case for RL-JACK’s design choices and highlighting ethical considerations and defense-oriented implications. Overall, RL-JACK demonstrates that DRL can effectively generate jailbreaking prompts, informing the development of more robust safety alignments for both open-source and commercial LLMs.

Abstract

Modern large language model (LLM) developers typically conduct a safety alignment to prevent an LLM from generating unethical or harmful content. Recent studies have discovered that the safety alignment of LLMs can be bypassed by jailbreaking prompts. These prompts are designed to create specific conversation scenarios with a harmful question embedded. Querying an LLM with such prompts can mislead the model into responding to the harmful question. The stochastic and random nature of existing genetic methods largely limits the effectiveness and efficiency of state-of-the-art (SOTA) jailbreaking attacks. In this paper, we propose RL-JACK, a novel black-box jailbreaking attack powered by deep reinforcement learning (DRL). We formulate the generation of jailbreaking prompts as a search problem and design a novel RL approach to solve it. Our method includes a series of customized designs to enhance the RL agent's learning efficiency in the jailbreaking context. Notably, we devise an LLM-facilitated action space that enables diverse action variations while constraining the overall search space. We propose a novel reward function that provides meaningful dense rewards for the agent toward achieving successful jailbreaking. Through extensive evaluations, we demonstrate that RL-JACK is overall much more effective than existing jailbreaking attacks against six SOTA LLMs, including large open-source models and commercial models. We also show the RL-JACK's resiliency against three SOTA defenses and its transferability across different models. Finally, we validate the insensitivity of RL-JACK to the variations in key hyper-parameters.

RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs

TL;DR

This work tackles the problem of safety-alignment bypass in LLMs by introducing RL-JACK, a deep reinforcement learning framework that treats jailbreaking prompt generation as a deterministic search in a black-box setting. A novel RL agent selects among ten jailbreaking strategies, guided by a helper LLM, to produce diverse and effective prompts; a dense cosine-similarity reward against an unaligned reference provides continuous feedback, while a tailored state-transition scheme maintains stability. PPO-based training yields policies that outperform state-of-the-art automatic and in-context methods across six LLMs and exhibit resilience to defenses and transferability between models. The study also presents thorough ablations and sensitivity analyses, strengthening the case for RL-JACK’s design choices and highlighting ethical considerations and defense-oriented implications. Overall, RL-JACK demonstrates that DRL can effectively generate jailbreaking prompts, informing the development of more robust safety alignments for both open-source and commercial LLMs.

Abstract

Modern large language model (LLM) developers typically conduct a safety alignment to prevent an LLM from generating unethical or harmful content. Recent studies have discovered that the safety alignment of LLMs can be bypassed by jailbreaking prompts. These prompts are designed to create specific conversation scenarios with a harmful question embedded. Querying an LLM with such prompts can mislead the model into responding to the harmful question. The stochastic and random nature of existing genetic methods largely limits the effectiveness and efficiency of state-of-the-art (SOTA) jailbreaking attacks. In this paper, we propose RL-JACK, a novel black-box jailbreaking attack powered by deep reinforcement learning (DRL). We formulate the generation of jailbreaking prompts as a search problem and design a novel RL approach to solve it. Our method includes a series of customized designs to enhance the RL agent's learning efficiency in the jailbreaking context. Notably, we devise an LLM-facilitated action space that enables diverse action variations while constraining the overall search space. We propose a novel reward function that provides meaningful dense rewards for the agent toward achieving successful jailbreaking. Through extensive evaluations, we demonstrate that RL-JACK is overall much more effective than existing jailbreaking attacks against six SOTA LLMs, including large open-source models and commercial models. We also show the RL-JACK's resiliency against three SOTA defenses and its transferability across different models. Finally, we validate the insensitivity of RL-JACK to the variations in key hyper-parameters.
Paper Structure (36 sections, 6 equations, 7 figures, 11 tables, 2 algorithms)

This paper contains 36 sections, 6 equations, 7 figures, 11 tables, 2 algorithms.

Figures (7)

  • Figure 1: Deterministic vs. stochastic search in a grid search problem. Here we assume the initial point is the block in the bottom left corner and the goal is to reach the black block on the top right corner following a certain strategy. The deterministic search moves towards the target following a fixed direction (for example given by the gradient), while the stochastic search jumps across different sub-regions.
  • Figure 2: Overview of RL-JACK. The texts in yellow and blue canvas represent our RL agent's state and action, respectively. The texts in grey and light red canvas represent our generated jailbreaking prompt and the target model's response to the prompt.
  • Figure 3: Demonstration of our state transition design. The agent selects the 5-th action at $t-1$ and the 0-th action at $t$. Without the crossover, the two continuous states can be very different ($\mathbf{p}^{(t-1)}$ vs. $\mathbf{p}^{(t)'}$). The state transition becomes much smoother after the crossover ($\mathbf{p}^{(t-1)}$ vs. $\mathbf{p}^{(t)}$).
  • Figure 4: Attack performance of RL-JACK when varying $\tau$.
  • Figure 5: Agent architecture. The snowflake indicates that part of the model is frozen during the agent training.
  • ...and 2 more figures