VALD: Multi-Stage Vision Attack Detection for Efficient LVLM Defense

Nadav Kadvil; Ayellet Tal

VALD: Multi-Stage Vision Attack Detection for Efficient LVLM Defense

Nadav Kadvil, Ayellet Tal

TL;DR

This work introduces a general, efficient, and training-free defense that combines image transformations with agentic data consolidation to recover correct model behavior and achieves state-of-the-art accuracy while maintaining notable efficiency.

Abstract

Large Vision-Language Models (LVLMs) can be vulnerable to adversarial images that subtly bias their outputs toward plausible yet incorrect responses. We introduce a general, efficient, and training-free defense that combines image transformations with agentic data consolidation to recover correct model behavior. A key component of our approach is a two-stage detection mechanism that quickly filters out the majority of clean inputs. We first assess image consistency under content-preserving transformations at negligible computational cost. For more challenging cases, we examine discrepancies in a text-embedding space. Only when necessary do we invoke a powerful LLM to resolve attack-induced divergences. A key idea is to consolidate multiple responses, leveraging both their similarities and their differences. We show that our method achieves state-of-the-art accuracy while maintaining notable efficiency: most clean images skip costly processing, and even in the presence of numerous adversarial examples, the overhead remains minimal.

VALD: Multi-Stage Vision Attack Detection for Efficient LVLM Defense

TL;DR

Abstract

Paper Structure (13 sections, 4 equations, 6 figures, 4 tables)

This paper contains 13 sections, 4 equations, 6 figures, 4 tables.

Introduction
Related Work
Method
Evaluation
Experimental setup
Results
Ablation study
Conclusion
Limitations
Prompts
Example for reasoning process
Additional qualitative results
Visualization of the limitation

Figures (6)

Figure 1: LVLM Defense. Given an adversarial image designed to subtly steer the model toward benign but incorrect outputs, our goal is to produce the correct response (e.g., caption) for the input.
Figure 2: Architecture. The input image $x$, which may be clean (green) or adversarial (red), undergoes various transformations. The early detection gets as input the input image and its transformations and determines whether $x$ is suspected as adversarial. If the image is suspected to be adversarial, the LVLM is queried independently on the image and its transformations (solid red arrow) using the original instruction, producing a response set $R$ that is then processed in the late detection phase. Otherwise, the defense process is bypassed, and the LVLM is queried only with the original image (solid green arrow). The late detection phase receives $R$ as input and determines whether the image has been attacked by analyzing the textual embedding space. If no attack is detected, only the response generated from the original input image is used as the final output. If an attack is detected, an LLM-based consolidation process is applied to $R$ to generate the final correct response as part of the defense.
Figure 3: Qualitative results. Although the attacked images (b) appear nearly identical to the originals (a), the LVLM incorrectly captions them (c). In contrast, our model successfully defends against the attack and generates accurate captions (c). The images were attacked via zhao2023evaluating. See \ref{['app:qualitativ2']} for additional results.
Figure 4: Consolidation process explanation. The following elements are mentioned by most captions and form the basis of the final consolidated description: a snow-covered park or forest, a wooden bench, trees, and the bench being covered in snow. Conversely, the following details either conflict with the majority or appear only sporadically; therefore, they are treated as inconsistent and omitted from the final caption: a coconut beauty product, a bird on the bench, a green object on the tree, and a black-and-white filter.
Figure 5: Qualitative results. Although the attacked images (b) appear nearly identical to the originals (a), the LVLM incorrectly captions them (c). In contrast, our model successfully defends against the attack and generates accurate captions (c). The images were attacked via zhao2023evaluating.
...and 1 more figures

VALD: Multi-Stage Vision Attack Detection for Efficient LVLM Defense

TL;DR

Abstract

VALD: Multi-Stage Vision Attack Detection for Efficient LVLM Defense

Authors

TL;DR

Abstract

Table of Contents

Figures (6)