Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective

Tianlong Li; Zhenghua Wang; Wenhao Liu; Muling Wu; Shihan Dou; Changze Lv; Xiaohua Wang; Xiaoqing Zheng; Xuanjing Huang

Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective

Tianlong Li, Zhenghua Wang, Wenhao Liu, Muling Wu, Shihan Dou, Changze Lv, Xiaohua Wang, Xiaoqing Zheng, Xuanjing Huang

TL;DR

This work proposes a representation-engineering lens to explain LLM jailbreaking via internal 'safety patterns'—activation patterns in the hidden-state space that govern self-safeguard behavior. By constructing JailEval and deriving contrastive patterns, the authors identify layer-wise features whose manipulation can weaken or strengthen safety responses with minimal impact on fluency and general abilities. They demonstrate that weakening these patterns increases jailbreak success across multiple datasets and models, while strengthening them can defend against stealthy prompts. The findings offer a mechanistic, low-cost approach to understanding and mitigating jailbreaking, with important implications for safeguarding open-source LLMs.

Abstract

The recent surge in jailbreaking attacks has revealed significant vulnerabilities in Large Language Models (LLMs) when exposed to malicious inputs. While various defense strategies have been proposed to mitigate these threats, there has been limited research into the underlying mechanisms that make LLMs vulnerable to such attacks. In this study, we suggest that the self-safeguarding capability of LLMs is linked to specific activity patterns within their representation space. Although these patterns have little impact on the semantic content of the generated text, they play a crucial role in shaping LLM behavior under jailbreaking attacks. Our findings demonstrate that these patterns can be detected with just a few pairs of contrastive queries. Extensive experimentation shows that the robustness of LLMs against jailbreaking can be manipulated by weakening or strengthening these patterns. Further visual analysis provides additional evidence for our conclusions, providing new insights into the jailbreaking phenomenon. These findings highlight the importance of addressing the potential misuse of open-source LLMs within the community.

Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective

TL;DR

Abstract

Paper Structure (20 sections, 10 equations, 11 figures, 15 tables)

This paper contains 20 sections, 10 equations, 11 figures, 15 tables.

Introduction
Related Work
LLM Jailbreak
Representation Engineering
Method
Extracting Contrastive Patterns
Feature Localization
Pattern Construction
Experimental Setting
Experimental Results and Analysis
Main Result
Visualization Analysis
Ablation Study
Sensitivity Analysis
Conclusion
...and 5 more sections

Figures (11)

Figure 1: Illustrative examples of successful jailbreak when the model's safety patterns are weakened. See § \ref{['Appendix: Hyperparameter Used In Experiments']} for more cases on different topics.
Figure 2: Illustration of our work (taking Llama as an example). Extracting Safety Patterns:After obtaining the representation differences (Contrastive Patterns) of the queried pairs, we calculated LLM's Safety Patterns based on it. Jailbreak Attack with Safety Patterns:Weakening the model's safety patterns in the latent space of each layer's output would reduce its refusal ability to malicious instructions.
Figure 3: FR heatmaps of four LLMs on Sorry Bench. "$-$SP" indicates that the safety patterns have been weakened. The decline of LLM's self-safeguard ability resulted from weakening safety patterns across various malicious topics, demonstrating the general applicability of safety patterns.
Figure 4: The visualization results of Layer-1 activity patterns (on Llama2-7b-chat). For the visualization of other layers of the model, other models, and other jailbreaking methods, please refer to Appendix \ref{['Appendix: More Visualization Results']}.
Figure 5: The ASR-3 and PPL (mean and standard deviation) on AdvBench*. The figures show two types of PPL anomalies: Llama2-7b-chat has a very low mean and standard deviation of PPL due to repetitive single-word outputs, while the Llama2-13b-chat shows a significant increase in both mean and standard deviation of PPL due to garbled outputs (refer to Tab \ref{['Appendix: abnormal_cases']} for detailed cases).
...and 6 more figures

Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective

TL;DR

Abstract

Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (11)