Table of Contents
Fetching ...

Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs

Yuxuan Zhou, Yuzhao Peng, Yang Bai, Kuofeng Gao, Yihao Zhang, Yechao Zhang, Xun Chen, Tao Yu, Tao Dai, Shu-Tao Xia

TL;DR

The paper analyzes why weak-OOD perturbations can unexpectedly enhance jailbreaking of vision-language models, revealing an asymmetry between pre-training and safety alignment. By formalizing dual constraints on input-intent perception and refusal triggering, it shows that mild OOD shifts can preserve malicious intent while suppressing refusals, whereas larger shifts disrupt intent. It introduces JOCR, an OCR-inspired jailbreak that embeds malicious text into images and applies controlled visual perturbations, achieving superior attack success rates across multiple models and benchmarks. The findings deepen understanding of OOD-based vulnerabilities and suggest practical directions to strengthen VLM safety alignment and resilience against such attacks.

Abstract

Large Vision-Language Models (VLMs) are susceptible to jailbreak attacks: researchers have developed a variety of attack strategies that can successfully bypass the safety mechanisms of VLMs. Among these approaches, jailbreak methods based on the Out-of-Distribution (OOD) strategy have garnered widespread attention due to their simplicity and effectiveness. This paper further advances the in-depth understanding of OOD-based VLM jailbreak methods. Experimental results demonstrate that jailbreak samples generated via mild OOD strategies exhibit superior performance in circumventing the safety constraints of VLMs--a phenomenon we define as ''weak-OOD''. To unravel the underlying causes of this phenomenon, this study takes SI-Attack, a typical OOD-based jailbreak method, as the research object. We attribute this phenomenon to a trade-off between two dominant factors: input intent perception and model refusal triggering. The inconsistency in how these two factors respond to OOD manipulations gives rise to this phenomenon. Furthermore, we provide a theoretical argument for the inevitability of such inconsistency from the perspective of discrepancies between model pre-training and alignment processes. Building on the above insights, we draw inspiration from optical character recognition (OCR) capability enhancement--a core task in the pre-training phase of mainstream VLMs. Leveraging this capability, we design a simple yet highly effective VLM jailbreak method, whose performance outperforms that of SOTA baselines.

Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs

TL;DR

The paper analyzes why weak-OOD perturbations can unexpectedly enhance jailbreaking of vision-language models, revealing an asymmetry between pre-training and safety alignment. By formalizing dual constraints on input-intent perception and refusal triggering, it shows that mild OOD shifts can preserve malicious intent while suppressing refusals, whereas larger shifts disrupt intent. It introduces JOCR, an OCR-inspired jailbreak that embeds malicious text into images and applies controlled visual perturbations, achieving superior attack success rates across multiple models and benchmarks. The findings deepen understanding of OOD-based vulnerabilities and suggest practical directions to strengthen VLM safety alignment and resilience against such attacks.

Abstract

Large Vision-Language Models (VLMs) are susceptible to jailbreak attacks: researchers have developed a variety of attack strategies that can successfully bypass the safety mechanisms of VLMs. Among these approaches, jailbreak methods based on the Out-of-Distribution (OOD) strategy have garnered widespread attention due to their simplicity and effectiveness. This paper further advances the in-depth understanding of OOD-based VLM jailbreak methods. Experimental results demonstrate that jailbreak samples generated via mild OOD strategies exhibit superior performance in circumventing the safety constraints of VLMs--a phenomenon we define as ''weak-OOD''. To unravel the underlying causes of this phenomenon, this study takes SI-Attack, a typical OOD-based jailbreak method, as the research object. We attribute this phenomenon to a trade-off between two dominant factors: input intent perception and model refusal triggering. The inconsistency in how these two factors respond to OOD manipulations gives rise to this phenomenon. Furthermore, we provide a theoretical argument for the inevitability of such inconsistency from the perspective of discrepancies between model pre-training and alignment processes. Building on the above insights, we draw inspiration from optical character recognition (OCR) capability enhancement--a core task in the pre-training phase of mainstream VLMs. Leveraging this capability, we design a simple yet highly effective VLM jailbreak method, whose performance outperforms that of SOTA baselines.

Paper Structure

This paper contains 41 sections, 17 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Jailbreak Toxic Score and Attack Success Rate (ASR) against GPT-4o (first row), GPT-4.1 (middle row) and Doubao-1.6 (bottom row). We plot the attack result of the three attacks under different degrees of OOD perturbation. More detailed results can be found in the Appendix \ref{['fig1_detailed_results']}.
  • Figure 2: PCA Feature Visualization of Layers 17, 19, and 21. We plot the feature distribution of harmful QA samples and shuffle-class samples under different model layers.
  • Figure 3: Layer-wise variations of input-intent-perception and model-refusal-triggering. We plot the variations of these two metrics across model layers under different degrees of OOD.
  • Figure 4: Comparison of the rate of change between input-intent-perception and model-refusal-triggering vs. shuffle number
  • Figure 5: PCA Feature Visualization of Layers 18, 20, 23, 24, 25, 26. We plot the feature distribution of harmful QA samples and shuffle-class samples under different model layers.
  • ...and 1 more figures