Table of Contents
Fetching ...

Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

Zhihua Wei, Qiang Li, Jian Ruan, Zhenxin Qin, Leilei Wen, Dongrui Liu, Wen Shen

Abstract

Large vision-language models (VLMs) often exhibit weakened safety alignment with the integration of the visual modality. Even when text prompts contain explicit harmful intent, adding an image can substantially increase jailbreak success rates. In this paper, we observe that VLMs can clearly distinguish benign inputs from harmful ones in their representation space. Moreover, even among harmful inputs, jailbreak samples form a distinct internal state that is separable from refusal samples. These observations suggest that jailbreaks do not arise from a failure to recognize harmful intent. Instead, the visual modality shifts representations toward a specific jailbreak state, thereby leading to a failure to trigger refusal. To quantify this transition, we identify a jailbreak direction and define the jailbreak-related shift as the component of the image-induced representation shift along this direction. Our analysis shows that the jailbreak-related shift reliably characterizes jailbreak behavior, providing a unified explanation for diverse jailbreak scenarios. Finally, we propose a defense method that enhances VLM safety by removing the jailbreak-related shift (JRS-Rem) at inference time. Experiments show that JRS-Rem provides strong defense across multiple scenarios while preserving performance on benign tasks.

Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

Abstract

Large vision-language models (VLMs) often exhibit weakened safety alignment with the integration of the visual modality. Even when text prompts contain explicit harmful intent, adding an image can substantially increase jailbreak success rates. In this paper, we observe that VLMs can clearly distinguish benign inputs from harmful ones in their representation space. Moreover, even among harmful inputs, jailbreak samples form a distinct internal state that is separable from refusal samples. These observations suggest that jailbreaks do not arise from a failure to recognize harmful intent. Instead, the visual modality shifts representations toward a specific jailbreak state, thereby leading to a failure to trigger refusal. To quantify this transition, we identify a jailbreak direction and define the jailbreak-related shift as the component of the image-induced representation shift along this direction. Our analysis shows that the jailbreak-related shift reliably characterizes jailbreak behavior, providing a unified explanation for diverse jailbreak scenarios. Finally, we propose a defense method that enhances VLM safety by removing the jailbreak-related shift (JRS-Rem) at inference time. Experiments show that JRS-Rem provides strong defense across multiple scenarios while preserving performance on benign tasks.
Paper Structure (38 sections, 3 equations, 15 figures, 9 tables)

This paper contains 38 sections, 3 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Illustration of the VLM jailbreak mechanism. Jailbreak samples form a distinct state (red circle) in the representation space, separable from benign (blue circle) and refusal (green circle) states. The image-induced representation shift contains a jailbreak-related shift component along the jailbreak direction, which steers the VLM's representation into the jailbreak state.
  • Figure 2: PCA visualization of the representation space. Jailbreak samples form a distinct cluster, clearly separated from both benign samples and refusal samples. Additional results are provided in Appendix \ref{['app:res-sec3']}.
  • Figure 3: Representation analysis on LLaVA-1.5-7B. (a) Average cosine distance to the jailbreak centroid. Shaded areas denote standard deviation. Benign and refusal samples remain distant from the jailbreak centroid. (b) Linear probing F1 scores. High F1 scores confirm that the three categories are linearly separable. Additional results are provided in Appendix \ref{['app:res-sec3']}.
  • Figure 4: Average normalized jailbreak-related shift on LLaVA-1.5-7B across different scenarios: (a) explicitly harmful, (b) implicitly harmful, (c) adversarial attack, and (d) benign. Jailbreak samples consistently exhibit larger shift than refusal samples, while benign samples remain near zero. Additional results are provided in Appendix \ref{['app:res-sec4']}.
  • Figure 5: (a) Image examples with increasing levels of harmful visual information. (b) Average normalized jailbreak-related shift across layers, showing a progressive increase as more harmful information is introduced. (c) Relationship between the jailbreak-related shift (layer 19) and ASR. As the amount of harmful visual information increases, the ASR and the jailbreak-related shift increase concurrently.
  • ...and 10 more figures