Table of Contents
Fetching ...

Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models

Ruofan Wang, Xin Wang, Yang Yao, Xuan Tong, Xingjun Ma

TL;DR

This work reveals a grey-box vulnerability in fine-tuned Vision-Language Models: base models harbor transferable jailbreaks that persist across downstream variants. It introduces Simulated Ensemble Attack (SEA), which combines Fine-tuning Trajectory Simulation (FTS) via vision-encoder perturbations and Targeted Prompt Guidance (TPG) to stabilize optimization and steer outputs, achieving high transferability. Experiments on Qwen2-VL-2B/7B show SEA attains ASR above 86.54% with substantial toxicity increases on RealToxicityPrompts, even for safety-tuned variants, highlighting inherited vulnerabilities across the model lifecycle. The results underscore the need for inheritance-aware defenses that secure both base and downstream models, not just downstream safety tuning, to counter transferable vulnerabilities in multimodal AI systems.

Abstract

Fine-tuning open-source Vision-Language Models (VLMs) creates a critical yet underexplored attack surface: vulnerabilities in the base VLM could be retained in fine-tuned variants, rendering them susceptible to transferable jailbreak attacks. To demonstrate this risk, we introduce the Simulated Ensemble Attack (SEA), a novel grey-box jailbreak method in which the adversary has full access to the base VLM but no knowledge of the fine-tuned target's weights or training configuration. To improve jailbreak transferability across fine-tuned VLMs, SEA combines two key techniques: Fine-tuning Trajectory Simulation (FTS) and Targeted Prompt Guidance (TPG). FTS generates transferable adversarial images by simulating the vision encoder's parameter shifts, while TPG is a textual strategy that steers the language decoder toward adversarially optimized outputs. Experiments on the Qwen2-VL family (2B and 7B) demonstrate that SEA achieves high transfer attack success rates exceeding 86.5% and toxicity rates near 49.5% across diverse fine-tuned variants, even those specifically fine-tuned to improve safety behaviors. Notably, while direct PGD-based image jailbreaks rarely transfer across fine-tuned VLMs, SEA reliably exploits inherited vulnerabilities from the base model, significantly enhancing transferability. These findings highlight an urgent need to safeguard fine-tuned proprietary VLMs against transferable vulnerabilities inherited from open-source foundations, motivating the development of holistic defenses across the entire model lifecycle.

Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models

TL;DR

This work reveals a grey-box vulnerability in fine-tuned Vision-Language Models: base models harbor transferable jailbreaks that persist across downstream variants. It introduces Simulated Ensemble Attack (SEA), which combines Fine-tuning Trajectory Simulation (FTS) via vision-encoder perturbations and Targeted Prompt Guidance (TPG) to stabilize optimization and steer outputs, achieving high transferability. Experiments on Qwen2-VL-2B/7B show SEA attains ASR above 86.54% with substantial toxicity increases on RealToxicityPrompts, even for safety-tuned variants, highlighting inherited vulnerabilities across the model lifecycle. The results underscore the need for inheritance-aware defenses that secure both base and downstream models, not just downstream safety tuning, to counter transferable vulnerabilities in multimodal AI systems.

Abstract

Fine-tuning open-source Vision-Language Models (VLMs) creates a critical yet underexplored attack surface: vulnerabilities in the base VLM could be retained in fine-tuned variants, rendering them susceptible to transferable jailbreak attacks. To demonstrate this risk, we introduce the Simulated Ensemble Attack (SEA), a novel grey-box jailbreak method in which the adversary has full access to the base VLM but no knowledge of the fine-tuned target's weights or training configuration. To improve jailbreak transferability across fine-tuned VLMs, SEA combines two key techniques: Fine-tuning Trajectory Simulation (FTS) and Targeted Prompt Guidance (TPG). FTS generates transferable adversarial images by simulating the vision encoder's parameter shifts, while TPG is a textual strategy that steers the language decoder toward adversarially optimized outputs. Experiments on the Qwen2-VL family (2B and 7B) demonstrate that SEA achieves high transfer attack success rates exceeding 86.5% and toxicity rates near 49.5% across diverse fine-tuned variants, even those specifically fine-tuned to improve safety behaviors. Notably, while direct PGD-based image jailbreaks rarely transfer across fine-tuned VLMs, SEA reliably exploits inherited vulnerabilities from the base model, significantly enhancing transferability. These findings highlight an urgent need to safeguard fine-tuned proprietary VLMs against transferable vulnerabilities inherited from open-source foundations, motivating the development of holistic defenses across the entire model lifecycle.

Paper Structure

This paper contains 30 sections, 8 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Motivation of our work. Adversarial images that successfully jailbreak a base VLM often fail once the same model is fully fine-tuned, revealing a key challenge in achieving transferable jailbreak attacks.
  • Figure 2: Overview of the SEA framework. (1) The attack starts with a clean image and a harmful query enhanced with our Targeted Prompt Guidance (TPG). (2) SEA then attacks the public base VLM in a grey-box setting, crafting a robust adversarial image by simulating fine-tuning trajectories via vision encoder perturbations and using TPG to steer the text decoder. (3) The resulting image effectively transfers to diverse, privately fine-tuned VLMs, achieving a high attack success rate (ASR) without any further adaptation.
  • Figure 3: Comparison of responses from Qwen-2-VL-7B fully fine-tuned on the MIS dataset. While standard PGD-based adversarial images are detected as malicious and refused, our SEA attack reliably elicits harmful instructions.
  • Figure 4: Distribution of parameter changes $\Delta\theta$ between fine-tuned and base models. The left and right panels show the results for models fine-tuned on the OmniAlign-V dataset and the MIS dataset, respectively. The empirical distributions of $\Delta\theta$ (histograms) are compared against the injected Gaussian noise used in FTS.