Robustness of Vision Language Models Against Split-Image Harmful Input Attacks

Md Rafi Ur Rashid; MD Sadik Hossain Shanto; Vishnu Asutosh Dasu; Shagufta Mehnaz

Robustness of Vision Language Models Against Split-Image Harmful Input Attacks

Md Rafi Ur Rashid, MD Sadik Hossain Shanto, Vishnu Asutosh Dasu, Shagufta Mehnaz

TL;DR

This work identifies a new vulnerability in vision-language model safety: safety alignment learned on holistic images does not generalize to split-image inputs where harmful cues are distributed across fragments. The authors formalize a constrained multimodal optimization framework, propose SIVA with three progressive phases, and develop Adv-KD to dramatically improve cross-model transferability in fully black-box settings. Through evaluations on three modern VLMs and three jailbreak datasets, Adaptive-SIVA and Transfer-SIVA achieve substantial gains over baselines, with Transfer-SIVA yielding up to 60% higher transfer success. To mitigate the risk, they extend the Direct Preference Optimization framework (aDPO) to efficiently incorporate split-image instances into safety alignment. Overall, the paper highlights the need to account for multi-image jailbreaks in safety training to build more robust and trustworthy VLMs.

Abstract

Vision-Language Models (VLMs) are now a core part of modern AI. Recent work proposed several visual jailbreak attacks using single/ holistic images. However, contemporary VLMs demonstrate strong robustness against such attacks due to extensive safety alignment through preference optimization (e.g., RLHF). In this work, we identify a new vulnerability: while VLM pretraining and instruction tuning generalize well to split-image inputs, safety alignment is typically performed only on holistic images and does not account for harmful semantics distributed across multiple image fragments. Consequently, VLMs often fail to detect and refuse harmful split-image inputs, where unsafe cues emerge only after combining images. We introduce novel split-image visual jailbreak attacks (SIVA) that exploit this misalignment. Unlike prior optimization-based attacks, which exhibit poor black-box transferability due to architectural and prior mismatches across models, our attacks evolve in progressive phases from naive splitting to an adaptive white-box attack, culminating in a black-box transfer attack. Our strongest strategy leverages a novel adversarial knowledge distillation (Adv-KD) algorithm to substantially improve cross-model transferability. Evaluations on three state-of-the-art modern VLMs and three jailbreak datasets demonstrate that our strongest attack achieves up to 60% higher transfer success than existing baselines. Lastly, we propose efficient ways to address this critical vulnerability in the current VLM safety alignment.

Robustness of Vision Language Models Against Split-Image Harmful Input Attacks

TL;DR

Abstract

Paper Structure (41 sections, 26 equations, 19 figures, 4 tables)

This paper contains 41 sections, 26 equations, 19 figures, 4 tables.

Introduction
Formalization of VLM Jailbreak Attacks
Visual Objective $\mathcal{L}_v$:
Textual Objective $\mathcal{L}_t$:
Literature Review:
VLMs' Safety Vulnerability Against Split-Image Inputs
Experiment Setup
Datasets.
Models.
Evaluation Metrics.
Methodology and Evaluation
Phase 1: Naı̈ve SIVA
Attack Strategy
Results
Limitations
...and 26 more sections

Figures (19)

Figure 1: Observations: (a) Single benign image for a benign task, (b) Split benign images for a benign task, (c) Single harmful image for jailbreak, (d) 1:3--split harmful image for jailbreak, (e) 1:1:1--split harmful image for jailbreak.
Figure 2: A split-image jailbreak instance passing through VLM. $\times$ implies features are restricted from attending each other, and ✓ implies features attend to each other
Figure 3: Instance of (a) jailbreakV (b) Forbidden Recipe and (c) Reddit-NSFW (blurred) dataset samples.
Figure 4: Instance of complementary split-ratio, 1:3 and 3:1 in (a) jailbreakV, (b) FR, and (c) NSFW (blurred) dataset. (d) Variation in ASR for these two complementary split ratios.
Figure 5: (a) The VT module does not merge distinct images, but merges split images into a single input. (b) ASR with and without the visual filtering. (c) Accuracy of the VT module in distinguishing between distinct and split images.
...and 14 more figures

Theorems & Definitions (1)

Definition 1: Jailbreak Method on VLM

Robustness of Vision Language Models Against Split-Image Harmful Input Attacks

TL;DR

Abstract

Robustness of Vision Language Models Against Split-Image Harmful Input Attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (19)

Theorems & Definitions (1)