Table of Contents
Fetching ...

Adversarial Robustness of Vision in Open Foundation Models

Jonathon Fox, William J Buchanan, Pavlos Papadopoulos

TL;DR

The study evaluates the adversarial robustness of open-weight Vision-Language Models (LLaVA-1.5-13B and Llama-3.2 Vision-8B-2) under untargeted PGD attacks on the VQA v2 subset. It reveals that Llama-3.2 Vision exhibits smaller accuracy drops at higher perturbations despite a lower baseline, indicating that robustness is not strictly tied to standard accuracy. The work highlights the vision modality as a viable attack surface in multimodal systems and discusses how architecture and training influence robustness. These findings motivate broader safety evaluations and defense strategies for open-weight vision-language foundations.

Abstract

With the increase in deep learning, it becomes increasingly difficult to understand the model in which AI systems can identify objects. Thus, an adversary could aim to modify an image by adding unseen elements, which will confuse the AI in its recognition of an entity. This paper thus investigates the adversarial robustness of LLaVA-1.5-13B and Meta's Llama 3.2 Vision-8B-2. These are tested for untargeted PGD (Projected Gradient Descent) against the visual input modality, and empirically evaluated on the Visual Question Answering (VQA) v2 dataset subset. The results of these adversarial attacks are then quantified using the standard VQA accuracy metric. This evaluation is then compared with the accuracy degradation (accuracy drop) of LLaVA and Llama 3.2 Vision. A key finding is that Llama 3.2 Vision, despite a lower baseline accuracy in this setup, exhibited a smaller drop in performance under attack compared to LLaVA, particularly at higher perturbation levels. Overall, the findings confirm that the vision modality represents a viable attack vector for degrading the performance of contemporary open-weight VLMs, including Meta's Llama 3.2 Vision. Furthermore, they highlight that adversarial robustness does not necessarily correlate directly with standard benchmark performance and may be influenced by underlying architectural and training factors.

Adversarial Robustness of Vision in Open Foundation Models

TL;DR

The study evaluates the adversarial robustness of open-weight Vision-Language Models (LLaVA-1.5-13B and Llama-3.2 Vision-8B-2) under untargeted PGD attacks on the VQA v2 subset. It reveals that Llama-3.2 Vision exhibits smaller accuracy drops at higher perturbations despite a lower baseline, indicating that robustness is not strictly tied to standard accuracy. The work highlights the vision modality as a viable attack surface in multimodal systems and discusses how architecture and training influence robustness. These findings motivate broader safety evaluations and defense strategies for open-weight vision-language foundations.

Abstract

With the increase in deep learning, it becomes increasingly difficult to understand the model in which AI systems can identify objects. Thus, an adversary could aim to modify an image by adding unseen elements, which will confuse the AI in its recognition of an entity. This paper thus investigates the adversarial robustness of LLaVA-1.5-13B and Meta's Llama 3.2 Vision-8B-2. These are tested for untargeted PGD (Projected Gradient Descent) against the visual input modality, and empirically evaluated on the Visual Question Answering (VQA) v2 dataset subset. The results of these adversarial attacks are then quantified using the standard VQA accuracy metric. This evaluation is then compared with the accuracy degradation (accuracy drop) of LLaVA and Llama 3.2 Vision. A key finding is that Llama 3.2 Vision, despite a lower baseline accuracy in this setup, exhibited a smaller drop in performance under attack compared to LLaVA, particularly at higher perturbation levels. Overall, the findings confirm that the vision modality represents a viable attack vector for degrading the performance of contemporary open-weight VLMs, including Meta's Llama 3.2 Vision. Furthermore, they highlight that adversarial robustness does not necessarily correlate directly with standard benchmark performance and may be influenced by underlying architectural and training factors.

Paper Structure

This paper contains 52 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The Transformer architecture illustrates the encoder (left) and decoder (right) stacks with multi-head attention and feed-forward layers. Reproduced from Vaswani et al. (2017) vaswaniAttentionAllYou2017.
  • Figure 2: Demonstration of adversarial perturbation using FGSM. An imperceptible perturbation $\eta = \epsilon \cdot \text{sign}(\nabla_x J(\theta, x, y))$ is added to the original image (a), resulting in an adversarial image (c) that causes misclassification with high confidence. The noise (b) is scaled for visibility. Reproduced from Goodfellow et al. (2015) goodfellowExplainingHarnessingAdversarial2015.
  • Figure 3: VQA Accuracy vs. Adversarial Perturbation Strength ($\epsilon$). Compares LLaVA-1.5-13B and Llama 3.2 Vision-8B-2 performance under PGD attack with varying $L_\infty$ budgets.