InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
Xunguang Wang, Zhenlan Ji, Pingchuan Ma, Zongjie Li, Shuai Wang
TL;DR
This work tackles targeted adversarial attacks on LVLMs under a gray-box setting where the attacker knows only the vision encoder. It proposes InstructTA, which uses a target text $y_t$ to generate a target image $x_t$ via a text-to-image model and infers an instruction $p'$ with GPT-4, then optimizes perturbations against a surrogate model to align with the target in an instruction-aware feature space, augmented by paraphrase-based instructions. A dual objective incorporating MF-it improves transferability across diverse LVLM backends; PGD with a dynamic instruction set yields strong targeted attack performance. Experiments across five LVLMs show that InstructTA outperforms baselines in CLIP-score and attack success rate, including under cross-instruction transfer scenarios. The work highlights security concerns for LVLMs and suggests mitigation strategies like adversarial training and detection based on instruction-aware feature discrepancies.
Abstract
Large vision-language models (LVLMs) have demonstrated their incredible capability in image understanding and response generation. However, this rich visual interaction also makes LVLMs vulnerable to adversarial examples. In this paper, we formulate a novel and practical targeted attack scenario that the adversary can only know the vision encoder of the victim LVLM, without the knowledge of its prompts (which are often proprietary for service providers and not publicly available) and its underlying large language model (LLM). This practical setting poses challenges to the cross-prompt and cross-model transferability of targeted adversarial attack, which aims to confuse the LVLM to output a response that is semantically similar to the attacker's chosen target text. To this end, we propose an instruction-tuned targeted attack (dubbed \textsc{InstructTA}) to deliver the targeted adversarial attack on LVLMs with high transferability. Initially, we utilize a public text-to-image generative model to "reverse" the target response into a target image, and employ GPT-4 to infer a reasonable instruction $\boldsymbol{p}^\prime$ from the target response. We then form a local surrogate model (sharing the same vision encoder with the victim LVLM) to extract instruction-aware features of an adversarial image example and the target image, and minimize the distance between these two features to optimize the adversarial example. To further improve the transferability with instruction tuning, we augment the instruction $\boldsymbol{p}^\prime$ with instructions paraphrased from GPT-4. Extensive experiments demonstrate the superiority of our proposed method in targeted attack performance and transferability. The code is available at https://github.com/xunguangwang/InstructTA.
