Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques
Rishika Bhagwatkar, Shravan Nayak, Reza Bayat, Alexis Roger, Daniel Z Kaplan, Pouya Bashivan, Irina Rish
TL;DR
This study investigates adversarial robustness in vision-language models (VLMs), examining how design choices (vision encoder, LLM size, mapping, and input resolution) and prompt formatting affect resilience to white-box image perturbations. It systematically evaluates FGSM, PGD, and APGD attacks across multiple tasks (captioning and VQA) and model configurations, revealing that vision-encoder characteristics and prompt strategies largely shape robustness, while language-model size offers limited benefit. Notably, ensemble approaches can be vulnerable to a single attacked encoder, but prompt-based defenses—such as rephrasing prompts or signaling potential perturbations—promise substantial robustness gains without costly adversarial training. The findings offer actionable guidance for deploying robust, safety-conscious VLMs in real-world settings and underscore prompt formatting as a practical mitigation tool.
Abstract
Vision-Language Models (VLMs) have witnessed a surge in both research and real-world applications. However, as they are becoming increasingly prevalent, ensuring their robustness against adversarial attacks is paramount. This work systematically investigates the impact of model design choices on the adversarial robustness of VLMs against image-based attacks. Additionally, we introduce novel, cost-effective approaches to enhance robustness through prompt formatting. By rephrasing questions and suggesting potential adversarial perturbations, we demonstrate substantial improvements in model robustness against strong image-based attacks such as Auto-PGD. Our findings provide important guidelines for developing more robust VLMs, particularly for deployment in safety-critical environments.
