Table of Contents
Fetching ...

Improving Adversarial Transferability in MLLMs via Dynamic Vision-Language Alignment Attack

Chenhe Gu, Jindong Gu, Andong Hua, Yao Qin

TL;DR

Multimodal Large Language Models (MLLMs) are vulnerable to adversarial inputs, yet adversarial transfer across models remains weak, especially for targeted outputs. The authors propose Dynamic Vision-Language Alignment (DynVLA), a method that dynamically perturbs the attention in the vision-language connector using a Gaussian kernel to diversify vision-language modality alignments and improve cross-model transferability. DynVLA operates within a PGD framework, perturbing attention around randomly selected image tokens with a kernel size $m \in \{3,5\}$ and perturbation budget $\epsilon = 16/255$, and is effective across open-source models (BLIP2, InstructBLIP, MiniGPT4, LLaVA) and even shows influence on closed models like Gemini. Ablation studies show kernel size and budget influence transferability, suggesting that perturbing alignment rather than raw pixels yields stronger cross-model vulnerability. The work underscores robustness challenges for real-world deployment and points to defense strategies and future research directions in securing multimodal AI systems.

Abstract

Multimodal Large Language Models (MLLMs), built upon LLMs, have recently gained attention for their capabilities in image recognition and understanding. However, while MLLMs are vulnerable to adversarial attacks, the transferability of these attacks across different models remains limited, especially under targeted attack setting. Existing methods primarily focus on vision-specific perturbations but struggle with the complex nature of vision-language modality alignment. In this work, we introduce the Dynamic Vision-Language Alignment (DynVLA) Attack, a novel approach that injects dynamic perturbations into the vision-language connector to enhance generalization across diverse vision-language alignment of different models. Our experimental results show that DynVLA significantly improves the transferability of adversarial examples across various MLLMs, including BLIP2, InstructBLIP, MiniGPT4, LLaVA, and closed-source models such as Gemini.

Improving Adversarial Transferability in MLLMs via Dynamic Vision-Language Alignment Attack

TL;DR

Multimodal Large Language Models (MLLMs) are vulnerable to adversarial inputs, yet adversarial transfer across models remains weak, especially for targeted outputs. The authors propose Dynamic Vision-Language Alignment (DynVLA), a method that dynamically perturbs the attention in the vision-language connector using a Gaussian kernel to diversify vision-language modality alignments and improve cross-model transferability. DynVLA operates within a PGD framework, perturbing attention around randomly selected image tokens with a kernel size and perturbation budget , and is effective across open-source models (BLIP2, InstructBLIP, MiniGPT4, LLaVA) and even shows influence on closed models like Gemini. Ablation studies show kernel size and budget influence transferability, suggesting that perturbing alignment rather than raw pixels yields stronger cross-model vulnerability. The work underscores robustness challenges for real-world deployment and points to defense strategies and future research directions in securing multimodal AI systems.

Abstract

Multimodal Large Language Models (MLLMs), built upon LLMs, have recently gained attention for their capabilities in image recognition and understanding. However, while MLLMs are vulnerable to adversarial attacks, the transferability of these attacks across different models remains limited, especially under targeted attack setting. Existing methods primarily focus on vision-specific perturbations but struggle with the complex nature of vision-language modality alignment. In this work, we introduce the Dynamic Vision-Language Alignment (DynVLA) Attack, a novel approach that injects dynamic perturbations into the vision-language connector to enhance generalization across diverse vision-language alignment of different models. Our experimental results show that DynVLA significantly improves the transferability of adversarial examples across various MLLMs, including BLIP2, InstructBLIP, MiniGPT4, LLaVA, and closed-source models such as Gemini.

Paper Structure

This paper contains 18 sections, 3 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of the framework of our proposed DynVLA attack. DynVLA modifies the attention mechanism in the vision-language connector during the forward pass, forcing the model to focus on different parts of the image. Specifically, DynVLA adds a Gaussian kernel to the attention map to create a smooth attention shift. With the perturbed attention map, the generated adversarial attacks dynamically cover diverse vision-language modality alignments, significantly enhancing the transferability of DynVLA in attacking MLLMs.
  • Figure 2: DynVLA can outperform all other existing transfer attack methods. The left figure uses InstructBLIP FlanT5xl version as the surrogate model, and the right figure uses InstructBLIP Vicuna7B version as the surrogate model. The results show the ASR ($\%$) on the other seven target models. Some existing input-transform based trasfer attacks can also improve the ASR, however, these pixel-level augmentations are limited, while our method can augment the alignment of the vision-language modality.
  • Figure 3: Our method DynVLA is effective on different target outputs. In addition to the word "unknown", DynVLA can also significantly improve the ASR with target sentences such as "I don't know" and "I am sorry", as well as a common object "cat". Specifically, for target output "cat", our method achieves more than 80% ASR across all target models.
  • Figure 4: Successful adversarial examples on Gemini.
  • Figure 5: Ablation study of noise size, noise strength and perturbation bound. The left two sub-figures show the ASR ($\%$) under different noise sizes and strengths, and the right sub-figure shows the ASR ($\%$) of our methods and baseline under various perturbation bounds.
  • ...and 4 more figures