Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens

Xikang Yang; Xuehai Tang; Fuqing Zhu; Jizhong Han; Songlin Hu

Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens

Xikang Yang, Xuehai Tang, Fuqing Zhu, Jizhong Han, Songlin Hu

TL;DR

Cross-prompt transfer in vision-language models is hampered by a bias of adversarial tokens toward original image semantics. The authors propose Contextual Injection Attack (CIA), a gradient-based perturbation framework that injects target tokens into both visual and textual contexts to bias the token distribution toward the target text, formalized with $M_{VL}$, $P(I_{ori})=I_{ori}+\boldsymbol{\delta}_v$, and a loss $L_{total}=\alpha(\beta L_v+(1-\beta)L_t)+(1-\alpha)L_o$ optimized under $||\boldsymbol{\delta}_v||_p\le \epsilon_v$ via PGD. CIA jointly exploits visual and textual channels through losses $L_v$, $L_t$, and $L_o$ to maximize the probability of $T_{tgt}$ across prompts, achieving higher Attack Success Rates than Single-P, Multi-P, and CroPA in CLS, CAP, and VQA tasks. Extensive experiments on BLIP2, InstructBLIP, and LLaVA demonstrate that CIA significantly improves cross-prompt transferability, highlighting a potent adversarial tactic and informing defensive strategies for multimodal systems.

Abstract

Vision-language models (VLMs) seamlessly integrate visual and textual data to perform tasks such as image classification, caption generation, and visual question answering. However, adversarial images often struggle to deceive all prompts effectively in the context of cross-prompt migration attacks, as the probability distribution of the tokens in these images tends to favor the semantics of the original image rather than the target tokens. To address this challenge, we propose a Contextual-Injection Attack (CIA) that employs gradient-based perturbation to inject target tokens into both visual and textual contexts, thereby improving the probability distribution of the target tokens. By shifting the contextual semantics towards the target tokens instead of the original image semantics, CIA enhances the cross-prompt transferability of adversarial images.Extensive experiments on the BLIP2, InstructBLIP, and LLaVA models show that CIA outperforms existing methods in cross-prompt transferability, demonstrating its potential for more effective adversarial strategies in VLMs.

Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens

TL;DR

, and a loss

optimized under

via PGD. CIA jointly exploits visual and textual channels through losses

, and

to maximize the probability of

across prompts, achieving higher Attack Success Rates than Single-P, Multi-P, and CroPA in CLS, CAP, and VQA tasks. Extensive experiments on BLIP2, InstructBLIP, and LLaVA demonstrate that CIA significantly improves cross-prompt transferability, highlighting a potent adversarial tactic and informing defensive strategies for multimodal systems.

Abstract

Paper Structure (26 sections, 7 equations, 4 figures, 11 tables, 1 algorithm)

This paper contains 26 sections, 7 equations, 4 figures, 11 tables, 1 algorithm.

Introduction
Related works
Preliminary Analysis
Injecting misleading target tokens into visual context
Injecting misleading target tokens into textual context
Methodology
Overall Structure
Problem definition
Contextual Injection Attack (CIA)
Experiments
Datasets & Experimental settings
Evaluation metrics
Transferability comparison
Case study
CIA with different perturbation size
...and 11 more sections

Figures (4)

Figure 1: cross-prompt migration attack vulnerability: adversarial images favoring original semantics over target tokens.
Figure 2: Overall Structure of the CIA Framework: By injecting the target token into the positions of both visual and text tokens, the probability of the target token appearing in the visual and textual context is increased.
Figure 3: The plot for the cross-entropy (CE) values of the logits concerning the target task at different token positions: visual token positions, input text token positions, and generated text token positions. The horizontal axis represents the token positions (for example, in BLIP2, from left to right, the first 32 tokens represent visual tokens, followed by user input tokens, and finally the generated tokens). The scatter plot shows the specific CE values at each token position, while the horizontal lines indicate the average CE values for each of the three sections.
Figure 4: The plot for the impact of the weighted sum of loss parameters, presenting a heat map of ASR influenced by varying values of $\alpha$ and $\beta$.

Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens

TL;DR

Abstract

Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens

Authors

TL;DR

Abstract

Table of Contents

Figures (4)