PIP: Detecting Adversarial Examples in Large Vision-Language Models via Attention Patterns of Irrelevant Probe Questions

Yudong Zhang; Ruobing Xie; Jiansheng Chen; Xingwu Sun; Yu Wang

PIP: Detecting Adversarial Examples in Large Vision-Language Models via Attention Patterns of Irrelevant Probe Questions

Yudong Zhang, Ruobing Xie, Jiansheng Chen, Xingwu Sun, Yu Wang

TL;DR

The paper tackles adversarial threats to LVLMs by proposing PIP, a simple detector that uses the attention pattern of an irrelevant probe question to distinguish clean versus adversarial inputs. By training a lightweight classifier (SVM or small DT) on attention maps produced in response to a yes/no probe, PIP achieves high recall (>98%) and precision (>90%) across multiple LVLMs, datasets, and attack scenarios, including black-box settings. The approach requires only one additional inference per test image and is shown to generalize across datasets and attack methods, though it benefits from multiple probes to reduce false alarms. Overall, PIP offers a practical, interpretable, and effective mechanism to detect adversarial examples in LVLMs and can inform post-processing defenses and system safety.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated their powerful multimodal capabilities. However, they also face serious safety problems, as adversaries can induce robustness issues in LVLMs through the use of well-designed adversarial examples. Therefore, LVLMs are in urgent need of detection tools for adversarial examples to prevent incorrect responses. In this work, we first discover that LVLMs exhibit regular attention patterns for clean images when presented with probe questions. We propose an unconventional method named PIP, which utilizes the attention patterns of one randomly selected irrelevant probe question (e.g., "Is there a clock?") to distinguish adversarial examples from clean examples. Regardless of the image to be tested and its corresponding question, PIP only needs to perform one additional inference of the image to be tested and the probe question, and then achieves successful detection of adversarial examples. Even under black-box attacks and open dataset scenarios, our PIP, coupled with a simple SVM, still achieves more than 98% recall and a precision of over 90%. Our PIP is the first attempt to detect adversarial attacks on LVLMs via simple irrelevant probe questions, shedding light on deeper understanding and introspection within LVLMs. The code is available at https://github.com/btzyd/pip.

PIP: Detecting Adversarial Examples in Large Vision-Language Models via Attention Patterns of Irrelevant Probe Questions

TL;DR

Abstract

Paper Structure (21 sections, 2 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 21 sections, 2 equations, 5 figures, 8 tables, 1 algorithm.

Introduction
Related Works
Large Vision-Language Models
Adversarial Attacks and Adversarial Examples
Detecting Adversarial Examples
A New Task: Detecting Adversarial Examples in Large Vision-Language Models
Definition of Our Adversarial Examples Detection Task
Evaluation of Our Detection Task
Explore the Use of Our PIP to Detecting Adversarial Examples
LVLMs Have Regularized Attention Patterns of Clean Examples to Yes/No Questions
The Attention Patterns of "yes/no" Probe Questions between Clean and Adversarial Examples are Clearly Distinguishable
Distinguishing Attention Maps via Lightweight Support Vector Machine
Exploring the PIP's Decision-making Process with Decision Trees
In-depth Analyses on PIP
Generalization of Our Adversarial Examples Detection Method across Datasets
...and 6 more sections

Figures (5)

Figure 1: Implications for our adversarial example detection method PIP. (Top): LVLMs can give correct answers for clean images. (Middle): LVLMs may give incorrect answers for adversarial images. (Down): When detecting adversarial examples through our simple PIP, LVLMs reject answers for adversarial examples to prevent security risks.
Figure 2: The pipeline of our proposed PIP. The top half is operated offline, while the bottom half is operated online. (Top): We perform adversarial attacks on $N$ images in $\mathcal{D}^{clean}_{ref}$ and obtain $N$ adversarial images, which constitute $\mathcal{D}^{adv}_{ref}$. We extract their attention maps (attention of the first word generated by LLM to all image tokens) of the LVLM with the irrelevant probe question "Is there a clock", and train a lightweight linear classifier (e.g., SVM) with these $2N$ attention maps. (Down): For images to be tested from $\mathcal{D}_{test}$, we first get their attention maps with the same probe question, and use the classifier to determine whether they are adversarial examples or not. Surprisingly, this simple method PIP functions well in this challenging task.
Figure 3: The attention maps of different types of questions on 1,000 randomly-selected images and questions. Due to space limitations, we select only one layer (the 16th layer of the LLM) and display the maximum value in the multi-head attention. The attention map of (a) "yes/no" is more regular than (b) "number" and (c) "other", indicating that the simple "yes/no" is a more suitable probe question.
Figure 4: The attention maps of $\mathcal{D}^\text{1k}_\text{clean}$, $\mathcal{D}^{1k}_\text{CLIP}$ and $\mathcal{D}^\text{1k}_\text{LLM}$. Due to space limitations, we select only one layer (the 16th layer of the LLM) and take the maximum value in the multi-head attention. The probe questions are all "Is there a clock?". The attention maps of adversarial examples differed significantly from those of the clean examples on certain sensitive tokens (e.g., the 27th and 28 tokens), which are good indicators.
Figure 5: PIP with the decision-making process of decision trees. the DT(depth=2) linearly distinguishes between clean and adversarial examples only by the two feature dimensions of the attention maps.

PIP: Detecting Adversarial Examples in Large Vision-Language Models via Attention Patterns of Irrelevant Probe Questions

TL;DR

Abstract

PIP: Detecting Adversarial Examples in Large Vision-Language Models via Attention Patterns of Irrelevant Probe Questions

Authors

TL;DR

Abstract

Table of Contents

Figures (5)