Table of Contents
Fetching ...

JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit

Zeqing He, Zhibo Wang, Zhixuan Chu, Huiyu Xu, Wenhui Zhang, Qinglong Wang, Rui Zheng

TL;DR

<3-5 sentence high-level summary>

Abstract

Despite the outstanding performance of Large language Models (LLMs) in diverse tasks, they are vulnerable to jailbreak attacks, wherein adversarial prompts are crafted to bypass their security mechanisms and elicit unexpected responses. Although jailbreak attacks are prevalent, the understanding of their underlying mechanisms remains limited. Recent studies have explained typical jailbreaking behavior (e.g., the degree to which the model refuses to respond) of LLMs by analyzing representation shifts in their latent space caused by jailbreak prompts or identifying key neurons that contribute to the success of jailbreak attacks. However, these studies neither explore diverse jailbreak patterns nor provide a fine-grained explanation from the failure of circuit to the changes of representational, leaving significant gaps in uncovering the jailbreak mechanism. In this paper, we propose JailbreakLens, an interpretation framework that analyzes jailbreak mechanisms from both representation (which reveals how jailbreaks alter the model's harmfulness perception) and circuit perspectives~(which uncovers the causes of these deceptions by identifying key circuits contributing to the vulnerability), tracking their evolution throughout the entire response generation process. We then conduct an in-depth evaluation of jailbreak behavior on five mainstream LLMs under seven jailbreak strategies. Our evaluation reveals that jailbreak prompts amplify components that reinforce affirmative responses while suppressing those that produce refusal. This manipulation shifts model representations toward safe clusters to deceive the LLM, leading it to provide detailed responses instead of refusals. Notably, we find a strong and consistent correlation between representation deception and activation shift of key circuits across diverse jailbreak methods and multiple LLMs.

JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit

TL;DR

<3-5 sentence high-level summary>

Abstract

Despite the outstanding performance of Large language Models (LLMs) in diverse tasks, they are vulnerable to jailbreak attacks, wherein adversarial prompts are crafted to bypass their security mechanisms and elicit unexpected responses. Although jailbreak attacks are prevalent, the understanding of their underlying mechanisms remains limited. Recent studies have explained typical jailbreaking behavior (e.g., the degree to which the model refuses to respond) of LLMs by analyzing representation shifts in their latent space caused by jailbreak prompts or identifying key neurons that contribute to the success of jailbreak attacks. However, these studies neither explore diverse jailbreak patterns nor provide a fine-grained explanation from the failure of circuit to the changes of representational, leaving significant gaps in uncovering the jailbreak mechanism. In this paper, we propose JailbreakLens, an interpretation framework that analyzes jailbreak mechanisms from both representation (which reveals how jailbreaks alter the model's harmfulness perception) and circuit perspectives~(which uncovers the causes of these deceptions by identifying key circuits contributing to the vulnerability), tracking their evolution throughout the entire response generation process. We then conduct an in-depth evaluation of jailbreak behavior on five mainstream LLMs under seven jailbreak strategies. Our evaluation reveals that jailbreak prompts amplify components that reinforce affirmative responses while suppressing those that produce refusal. This manipulation shifts model representations toward safe clusters to deceive the LLM, leading it to provide detailed responses instead of refusals. Notably, we find a strong and consistent correlation between representation deception and activation shift of key circuits across diverse jailbreak methods and multiple LLMs.

Paper Structure

This paper contains 31 sections, 3 equations, 11 figures, 4 tables, 2 algorithms.

Figures (11)

  • Figure 1: An overview of JailbreakLens, a framework that interprets LLM jailbreak behavior via capturing representations and circuits. JailbreakLens first predicts the safety score of a specific representation and obtains two token sets that reflect affirmation and refusal in the representation space. We then use these token sets to attribute activation differences and identify key circuits. Finally, we combine the safety scores with key circuit activations to derive the final interpretation results.
  • Figure 2: Prediction accuracy of the probes and probing results of different jailbreak methods.
  • Figure 3: Decoding the jailbreak representation of each layer in Llama2-7b in vocabulary space.
  • Figure 4: Average refusal score for each attention head in Llama2-7b and Vicuna1.5-7b when responding to harmful and safe prompts, where L21H14 contributes most to refusal, and L26H04 contributes most to affirmations.
  • Figure 5: (a). Average refusal score attribution for each MLP layer in Llama2-7b on harmful and safe prompts. (b). Average refusal score attribution for each MLP layer on jailbreak prompts where each color representing a specific jailbreak.
  • ...and 6 more figures

Theorems & Definitions (10)

  • remark 1
  • remark 2
  • remark 3
  • remark 4
  • remark 5
  • remark 6
  • remark 7
  • remark 8
  • remark 9
  • remark 10