Table of Contents
Fetching ...

A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models

Mujtaba Hussain Mirza, Antonio D'Orazio, Odelia Melamed, Iacopo Masi

Abstract

Despite the rapid progress in multimodal models and Large Visual-Language Models (LVLM), they remain highly susceptible to adversarial perturbations, raising serious concerns about their reliability in real-world use. While adversarial training has become the leading paradigm for building models that are robust to adversarial attacks, Test-Time Transformations (TTT) have emerged as a promising strategy to boost robustness at inference. In light of this, we propose Energy-Guided Test-Time Transformation (ET3), a lightweight, training-free defense that enhances the robustness by minimizing the energy of the input samples. Our method is grounded in a theory that proves our transformation succeeds in classification under reasonable assumptions. We present extensive experiments demonstrating that ET3 provides a strong defense for classifiers, zero-shot classification with CLIP, and also for boosting the robustness of LVLMs in tasks such as Image Captioning and Visual Question Answering. Code is available at github.com/OmnAI-Lab/Energy-Guided-Test-Time-Defense .

A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models

Abstract

Despite the rapid progress in multimodal models and Large Visual-Language Models (LVLM), they remain highly susceptible to adversarial perturbations, raising serious concerns about their reliability in real-world use. While adversarial training has become the leading paradigm for building models that are robust to adversarial attacks, Test-Time Transformations (TTT) have emerged as a promising strategy to boost robustness at inference. In light of this, we propose Energy-Guided Test-Time Transformation (ET3), a lightweight, training-free defense that enhances the robustness by minimizing the energy of the input samples. Our method is grounded in a theory that proves our transformation succeeds in classification under reasonable assumptions. We present extensive experiments demonstrating that ET3 provides a strong defense for classifiers, zero-shot classification with CLIP, and also for boosting the robustness of LVLMs in tasks such as Image Captioning and Visual Question Answering. Code is available at github.com/OmnAI-Lab/Energy-Guided-Test-Time-Defense .

Paper Structure

This paper contains 31 sections, 1 theorem, 34 equations, 7 figures, 15 tables.

Key Result

Theorem 4.1

Let $\mathbf{x} \in {\mathbb R}^d$ be a data sample, and $y_t$ be its ground truth label. Let $f_\theta : {\mathbb R}^d \rightarrow {\mathbb R}^2$ be a binary classifier such that it's locally linear in $\mathcal{B}_\epsilon(\mathbf{x})$. Denote $r_x = f_\theta(\mathbf{x})_{y_t} - f_\theta(\mathbf{x for the $\textsc{ET3}\xspace$ defense transformation $\mathbf{z}$, parametrized by $T=1$ and $\alp

Figures (7)

  • Figure 1: (top) Presenting a natural image green mamba $\mathbf{x}$, and its adversarial image $\mathbf{x}^{\star}$ mistakenly classified as a zucchini by a robust classifier $f_\theta$. Given only $\mathbf{x}^{\star}$ and $f_\theta$, our ET3 test-time defense produces the correctly classified $\tilde{\mathbf{x}}$, thereby boosting adversarial robustness. The plot illustrates the logit's change: even though the ground-truth class is not the second best, ET3 still recovers it. (bottom)ET3 boosts robust accuracy of Large VLMs like LLaVa liu2023visual using standard or even Robust CLIP schlarmann2024robust on both image captioning and Visual QA. Note in VQA example, the Reds team is a different team from the Red Sox, and refers to the Cincinnati Reds team.
  • Figure 2: ① ET3 transforms the natural image $\mathbf{x}$ adding a small perturbation $\mathbf{z}$ optimized to lower the energy wrt to ImageNet-$21k$ proxy classes and concepts. This allows robust zero-shot classification; ② the transformed image transfers and protects Large VLM, thereby increasing their robustness. The VLM is not used in the optimization, and the optimized image simply transfers to VLM by using the internal representation of the visual encoder.
  • Figure 3: (left) The ET3 defense transformation for adversarial examples. Assuming local linearity of the model in the defense neighborhood $\mathcal{B}_\epsilon(\mathbf{x})$, and a large enough ratio $C$ between the norms of the gradients of the energy through each class logit, $e_0 \mathbf{g}_0$ and $e_1 \mathbf{g}_1$. The adversarial attack, determined to reduce the ground truth logit, follows the negative direction of the larger gradient ($-\mathbf{g}_1$), while our transformation follows its positive direction ($\mathbf{g}_1$), increasing the ground truth logit and pulling the adversarial point back to its ground truth region. Both might also increase the other logit, corresponding to the smaller gradient $\mathbf{g}_0$, that may introduce some smaller deviation. (right) Scatter plot of the ratio between the gradients norms $C$ and the logit margin at the transform image $\tilde{\mathbf{x}}$ on ImageNet robust classifier for 1000 randomly sampled images from ImageNet. For most samples for which $C>1$, we can see that the purified image is correctly classified (logit margin $> 0$). One can see a correlation between the norms ratio and the logits difference of the transformed image.
  • Figure 4: Robust accuracy across increasing attack strengths in the defense-unaware setting. Average zero-shot accuracy of CLIP over 14 benchmark datasets, showing that ET3 consistently improves the robustness of the TeCoA models trained with different defense strengths ($\epsilon_t$) as the attack strength ($\epsilon_a$) increases.
  • Figure 5: Qualitative comparison of generated captions for a sample image. ET3 corrects captions affected by adversarial attacks on standard CLIP and further refines captions produced by robust TeCoA and FARE. Green rows indicate semantically correct captions, red rows denote incorrect captions, and yellow rows highlight outputs with partial errors that still broadly reflect the image content. All attacks are generated with $\epsilon_a =4/255$.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 4.1
  • Remark 4.1: The Local Linearity and Gradient Norm Ratio Assumptions