Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective
Tianlong Li, Zhenghua Wang, Wenhao Liu, Muling Wu, Shihan Dou, Changze Lv, Xiaohua Wang, Xiaoqing Zheng, Xuanjing Huang
TL;DR
This work proposes a representation-engineering lens to explain LLM jailbreaking via internal 'safety patterns'—activation patterns in the hidden-state space that govern self-safeguard behavior. By constructing JailEval and deriving contrastive patterns, the authors identify layer-wise features whose manipulation can weaken or strengthen safety responses with minimal impact on fluency and general abilities. They demonstrate that weakening these patterns increases jailbreak success across multiple datasets and models, while strengthening them can defend against stealthy prompts. The findings offer a mechanistic, low-cost approach to understanding and mitigating jailbreaking, with important implications for safeguarding open-source LLMs.
Abstract
The recent surge in jailbreaking attacks has revealed significant vulnerabilities in Large Language Models (LLMs) when exposed to malicious inputs. While various defense strategies have been proposed to mitigate these threats, there has been limited research into the underlying mechanisms that make LLMs vulnerable to such attacks. In this study, we suggest that the self-safeguarding capability of LLMs is linked to specific activity patterns within their representation space. Although these patterns have little impact on the semantic content of the generated text, they play a crucial role in shaping LLM behavior under jailbreaking attacks. Our findings demonstrate that these patterns can be detected with just a few pairs of contrastive queries. Extensive experimentation shows that the robustness of LLMs against jailbreaking can be manipulated by weakening or strengthening these patterns. Further visual analysis provides additional evidence for our conclusions, providing new insights into the jailbreaking phenomenon. These findings highlight the importance of addressing the potential misuse of open-source LLMs within the community.
