Table of Contents
Fetching ...

Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques

Rishika Bhagwatkar, Shravan Nayak, Reza Bayat, Alexis Roger, Daniel Z Kaplan, Pouya Bashivan, Irina Rish

TL;DR

This study investigates adversarial robustness in vision-language models (VLMs), examining how design choices (vision encoder, LLM size, mapping, and input resolution) and prompt formatting affect resilience to white-box image perturbations. It systematically evaluates FGSM, PGD, and APGD attacks across multiple tasks (captioning and VQA) and model configurations, revealing that vision-encoder characteristics and prompt strategies largely shape robustness, while language-model size offers limited benefit. Notably, ensemble approaches can be vulnerable to a single attacked encoder, but prompt-based defenses—such as rephrasing prompts or signaling potential perturbations—promise substantial robustness gains without costly adversarial training. The findings offer actionable guidance for deploying robust, safety-conscious VLMs in real-world settings and underscore prompt formatting as a practical mitigation tool.

Abstract

Vision-Language Models (VLMs) have witnessed a surge in both research and real-world applications. However, as they are becoming increasingly prevalent, ensuring their robustness against adversarial attacks is paramount. This work systematically investigates the impact of model design choices on the adversarial robustness of VLMs against image-based attacks. Additionally, we introduce novel, cost-effective approaches to enhance robustness through prompt formatting. By rephrasing questions and suggesting potential adversarial perturbations, we demonstrate substantial improvements in model robustness against strong image-based attacks such as Auto-PGD. Our findings provide important guidelines for developing more robust VLMs, particularly for deployment in safety-critical environments.

Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques

TL;DR

This study investigates adversarial robustness in vision-language models (VLMs), examining how design choices (vision encoder, LLM size, mapping, and input resolution) and prompt formatting affect resilience to white-box image perturbations. It systematically evaluates FGSM, PGD, and APGD attacks across multiple tasks (captioning and VQA) and model configurations, revealing that vision-encoder characteristics and prompt strategies largely shape robustness, while language-model size offers limited benefit. Notably, ensemble approaches can be vulnerable to a single attacked encoder, but prompt-based defenses—such as rephrasing prompts or signaling potential perturbations—promise substantial robustness gains without costly adversarial training. The findings offer actionable guidance for deploying robust, safety-conscious VLMs in real-world settings and underscore prompt formatting as a practical mitigation tool.

Abstract

Vision-Language Models (VLMs) have witnessed a surge in both research and real-world applications. However, as they are becoming increasingly prevalent, ensuring their robustness against adversarial attacks is paramount. This work systematically investigates the impact of model design choices on the adversarial robustness of VLMs against image-based attacks. Additionally, we introduce novel, cost-effective approaches to enhance robustness through prompt formatting. By rephrasing questions and suggesting potential adversarial perturbations, we demonstrate substantial improvements in model robustness against strong image-based attacks such as Auto-PGD. Our findings provide important guidelines for developing more robust VLMs, particularly for deployment in safety-critical environments.
Paper Structure (24 sections, 3 figures, 17 tables)

This paper contains 24 sections, 3 figures, 17 tables.

Figures (3)

  • Figure 1: Performance of LLaVA-7B on the COCO dataset when the adversarial images are given along with different types of prompts (Original, AC, AP and Random). Clean accuracy represents the model's performance on unperturbed images.
  • Figure 2: Comparison between VLMs having different vision encoders (left), different input resolutions (center) and different LLM size (right). The comparison is based on the APGD accuracy averaged over all tasks as shown in Tables \ref{['tab:image-encoder-comparison']}, \ref{['tab:image-encoder-resolution']}, \ref{['tab:model-scale']} and \ref{['tab:ensemble-ves']}.
  • Figure 3: Performance of LLaVA-7B on VQA using questions generated by different types of prompts.