Table of Contents
Fetching ...

Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models

Quanchen Zou, Moyang Chen, Zonghao Ying, Wenzhuo Xu, Yisong Xiao, Deyue Zhang, Dongdong Yang, Zhao Liu, Xiangzheng Zhang

TL;DR

This paper identifies a systemic flaw where LVLMs can be induced to synthesize harmful logic from benign premises, and formalizes this attack paradigm as reasoning-Oriented Programming, drawing a structural analogy to Return-Oriented Programming in systems security.

Abstract

Large Vision-Language Models (LVLMs) undergo safety alignment to suppress harmful content. However, current defenses predominantly target explicit malicious patterns in the input representation, often overlooking the vulnerabilities inherent in compositional reasoning. In this paper, we identify a systemic flaw where LVLMs can be induced to synthesize harmful logic from benign premises. We formalize this attack paradigm as \textit{Reasoning-Oriented Programming}, drawing a structural analogy to Return-Oriented Programming in systems security. Just as ROP circumvents memory protections by chaining benign instruction sequences, our approach exploits the model's instruction-following capability to orchestrate a semantic collision of orthogonal benign inputs. We instantiate this paradigm via \tool{}, an automated framework that optimizes for \textit{semantic orthogonality} and \textit{spatial isolation}. By generating visual gadgets that are semantically decoupled from the harmful intent and arranging them to prevent premature feature fusion, \tool{} forces the malicious logic to emerge only during the late-stage reasoning process. This effectively bypasses perception-level alignment. We evaluate \tool{} on SafeBench and MM-SafetyBench across 7 state-of-the-art 0.LVLMs, including GPT-4o and Claude 3.7 Sonnet. Our results demonstrate that \tool{} consistently circumvents safety alignment, outperforming the strongest existing baseline by an average of 4.67\% on open-source models and 9.50\% on commercial models.

Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models

TL;DR

This paper identifies a systemic flaw where LVLMs can be induced to synthesize harmful logic from benign premises, and formalizes this attack paradigm as reasoning-Oriented Programming, drawing a structural analogy to Return-Oriented Programming in systems security.

Abstract

Large Vision-Language Models (LVLMs) undergo safety alignment to suppress harmful content. However, current defenses predominantly target explicit malicious patterns in the input representation, often overlooking the vulnerabilities inherent in compositional reasoning. In this paper, we identify a systemic flaw where LVLMs can be induced to synthesize harmful logic from benign premises. We formalize this attack paradigm as \textit{Reasoning-Oriented Programming}, drawing a structural analogy to Return-Oriented Programming in systems security. Just as ROP circumvents memory protections by chaining benign instruction sequences, our approach exploits the model's instruction-following capability to orchestrate a semantic collision of orthogonal benign inputs. We instantiate this paradigm via \tool{}, an automated framework that optimizes for \textit{semantic orthogonality} and \textit{spatial isolation}. By generating visual gadgets that are semantically decoupled from the harmful intent and arranging them to prevent premature feature fusion, \tool{} forces the malicious logic to emerge only during the late-stage reasoning process. This effectively bypasses perception-level alignment. We evaluate \tool{} on SafeBench and MM-SafetyBench across 7 state-of-the-art 0.LVLMs, including GPT-4o and Claude 3.7 Sonnet. Our results demonstrate that \tool{} consistently circumvents safety alignment, outperforming the strongest existing baseline by an average of 4.67\% on open-source models and 9.50\% on commercial models.
Paper Structure (47 sections, 7 equations, 10 figures, 4 tables)

This paper contains 47 sections, 7 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Analogy between ROP in software and VROP in LVLM. Code gadgets with control flow in ROP correspond to visual gadgets and prompt-driven reasoning in VROP.
  • Figure 2: Overview of VROP. VROP first constructs task-specific visual gadgets through Semantic Gadget Mining, and then synthesizes an optimized control prompt via Gradient-Free Optimization to steer the LVLM’s reasoning process. The optimized prompt orchestrates semantic composition over the visual gadgets, enabling the model to follow the attacker’s intent and ultimately achieve successful jailbreak.
  • Figure 3: ASR of VROP and baseline methods on commercial LVLMs.
  • Figure 4: ASR of VROP attack under defense mechanisms.
  • Figure 5: Ablation study on auxiliary models.
  • ...and 5 more figures