Table of Contents
Fetching ...

InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models

Xunguang Wang, Zhenlan Ji, Pingchuan Ma, Zongjie Li, Shuai Wang

TL;DR

This work tackles targeted adversarial attacks on LVLMs under a gray-box setting where the attacker knows only the vision encoder. It proposes InstructTA, which uses a target text $y_t$ to generate a target image $x_t$ via a text-to-image model and infers an instruction $p'$ with GPT-4, then optimizes perturbations against a surrogate model to align with the target in an instruction-aware feature space, augmented by paraphrase-based instructions. A dual objective incorporating MF-it improves transferability across diverse LVLM backends; PGD with a dynamic instruction set yields strong targeted attack performance. Experiments across five LVLMs show that InstructTA outperforms baselines in CLIP-score and attack success rate, including under cross-instruction transfer scenarios. The work highlights security concerns for LVLMs and suggests mitigation strategies like adversarial training and detection based on instruction-aware feature discrepancies.

Abstract

Large vision-language models (LVLMs) have demonstrated their incredible capability in image understanding and response generation. However, this rich visual interaction also makes LVLMs vulnerable to adversarial examples. In this paper, we formulate a novel and practical targeted attack scenario that the adversary can only know the vision encoder of the victim LVLM, without the knowledge of its prompts (which are often proprietary for service providers and not publicly available) and its underlying large language model (LLM). This practical setting poses challenges to the cross-prompt and cross-model transferability of targeted adversarial attack, which aims to confuse the LVLM to output a response that is semantically similar to the attacker's chosen target text. To this end, we propose an instruction-tuned targeted attack (dubbed \textsc{InstructTA}) to deliver the targeted adversarial attack on LVLMs with high transferability. Initially, we utilize a public text-to-image generative model to "reverse" the target response into a target image, and employ GPT-4 to infer a reasonable instruction $\boldsymbol{p}^\prime$ from the target response. We then form a local surrogate model (sharing the same vision encoder with the victim LVLM) to extract instruction-aware features of an adversarial image example and the target image, and minimize the distance between these two features to optimize the adversarial example. To further improve the transferability with instruction tuning, we augment the instruction $\boldsymbol{p}^\prime$ with instructions paraphrased from GPT-4. Extensive experiments demonstrate the superiority of our proposed method in targeted attack performance and transferability. The code is available at https://github.com/xunguangwang/InstructTA.

InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models

TL;DR

This work tackles targeted adversarial attacks on LVLMs under a gray-box setting where the attacker knows only the vision encoder. It proposes InstructTA, which uses a target text to generate a target image via a text-to-image model and infers an instruction with GPT-4, then optimizes perturbations against a surrogate model to align with the target in an instruction-aware feature space, augmented by paraphrase-based instructions. A dual objective incorporating MF-it improves transferability across diverse LVLM backends; PGD with a dynamic instruction set yields strong targeted attack performance. Experiments across five LVLMs show that InstructTA outperforms baselines in CLIP-score and attack success rate, including under cross-instruction transfer scenarios. The work highlights security concerns for LVLMs and suggests mitigation strategies like adversarial training and detection based on instruction-aware feature discrepancies.

Abstract

Large vision-language models (LVLMs) have demonstrated their incredible capability in image understanding and response generation. However, this rich visual interaction also makes LVLMs vulnerable to adversarial examples. In this paper, we formulate a novel and practical targeted attack scenario that the adversary can only know the vision encoder of the victim LVLM, without the knowledge of its prompts (which are often proprietary for service providers and not publicly available) and its underlying large language model (LLM). This practical setting poses challenges to the cross-prompt and cross-model transferability of targeted adversarial attack, which aims to confuse the LVLM to output a response that is semantically similar to the attacker's chosen target text. To this end, we propose an instruction-tuned targeted attack (dubbed \textsc{InstructTA}) to deliver the targeted adversarial attack on LVLMs with high transferability. Initially, we utilize a public text-to-image generative model to "reverse" the target response into a target image, and employ GPT-4 to infer a reasonable instruction from the target response. We then form a local surrogate model (sharing the same vision encoder with the victim LVLM) to extract instruction-aware features of an adversarial image example and the target image, and minimize the distance between these two features to optimize the adversarial example. To further improve the transferability with instruction tuning, we augment the instruction with instructions paraphrased from GPT-4. Extensive experiments demonstrate the superiority of our proposed method in targeted attack performance and transferability. The code is available at https://github.com/xunguangwang/InstructTA.
Paper Structure (15 sections, 7 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 15 sections, 7 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: The framework of our instruction-tuned targeted attack (InstructTA). Given a target text $\boldsymbol{y}_t$, we first transform it into the target image $\boldsymbol{x}_t$ with a text-to-image model $h_\xi$. Simultaneously, GPT-4 infers a reasonable instruction $\boldsymbol{p}^\prime$. Upon providing the augmented instruction $\boldsymbol{p}_{i}^\prime$ and $\boldsymbol{p}_{j}^\prime$ which are rephrased from $\boldsymbol{p}^\prime$ using GPT-4, the surrogate model $M$ extracts instruction-aware features of $\boldsymbol{x}_t$ and the AE $\boldsymbol{x}^\prime$, respectively. Finally, we minimize the $L_2$ distance between these two features to optimize $\boldsymbol{x}^\prime$.
  • Figure 2: The architecture of LVLM.
  • Figure 3: An example of rephrasing an instruction. Given the target response $\boldsymbol{y}_t$, GPT-4 infers an instruction, i.e., "Can you describe what's in the picture you're looking at related to food?". The real instruction assigned to this target text is "What are the essential components depicted in this image?"
  • Figure 4: Visualization examples of various targeted attack methods on InstructBLIP. "What are the essential components depicted in this image?" is a real instruction.
  • Figure 5: To explore the impact of varying $\epsilon$ values within the InstructTA, we conducted experiments aiming to achieve different levels of perturbed images on BLIP-2, i.e., referred to as the AE $\boldsymbol{x}^\prime$. Our findings indicate a degradation in the visual quality of $\boldsymbol{x}^\prime$, as quantified by the LPIPS zhang2018perceptual distance between the original image $\boldsymbol{x}$ and the adversarial image $\boldsymbol{x}^\prime$. Simultaneously, the effectiveness of targeted response generation reaches a saturation point. Consequently, it is crucial to establish an appropriate perturbation budget, such as $\epsilon=8$, to effectively balance the image quality and the targeted attack performance.