xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking
Sunbowen Lee, Shiwen Ni, Chi Wei, Shuaimin Li, Liyang Fan, Ahmadreza Argha, Hamid Alinejad-Rokny, Ruifeng Xu, Yicheng Gong, Min Yang
TL;DR
The paper tackles the vulnerability of safety-aligned LLMs to black-box jailbreaking and proposes xJailbreak, a reinforcement-learning method guided by representation-space proximity to rewrite prompts while preserving original intent. It formulates the task as an MDP and uses PPO with a reward that combines a Borderline Score $r_d$ and an Intent Score $r_i$, balanced by $\alpha$ and modulated by a discount factor $\gamma$, to optimize 10 rewriting templates. A representation-guided reward framework and an explicit task pipeline enable interpretable, semantically faithful jailbreaks, achieving state-of-the-art jailbreak performance across multiple open- and closed-source models, including $Qwen2.5-7B-Instruct$, $Llama3.1-8B-Instruct$, and $GPT-4o-0806$, with comprehensive ablations validating the contribution of each component. The approach highlights vulnerabilities in current safety alignments and provides a rigorous evaluation framework, including keyword checks, validity judgments, and intent-detection metrics, that can inform both attacker strategies and defenses. Overall, xJailbreak offers a principled, interpretable, and effective pathway to study and strengthen LLM safety against black-box jailbreaks.
Abstract
Safety alignment mechanism are essential for preventing large language models (LLMs) from generating harmful information or unethical content. However, cleverly crafted prompts can bypass these safety measures without accessing the model's internal parameters, a phenomenon known as black-box jailbreak. Existing heuristic black-box attack methods, such as genetic algorithms, suffer from limited effectiveness due to their inherent randomness, while recent reinforcement learning (RL) based methods often lack robust and informative reward signals. To address these challenges, we propose a novel black-box jailbreak method leveraging RL, which optimizes prompt generation by analyzing the embedding proximity between benign and malicious prompts. This approach ensures that the rewritten prompts closely align with the intent of the original prompts while enhancing the attack's effectiveness. Furthermore, we introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success. Experimental results show the superiority of our approach, achieving state-of-the-art (SOTA) performance on several prominent open and closed-source LLMs, including Qwen2.5-7B-Instruct, Llama3.1-8B-Instruct, and GPT-4o-0806. Our method sets a new benchmark in jailbreak attack effectiveness, highlighting potential vulnerabilities in LLMs. The codebase for this work is available at https://github.com/Aegis1863/xJailbreak.
