Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models
Christian Schlarmann, Naman Deep Singh, Francesco Croce, Matthias Hein
TL;DR
This work tackles the vulnerability of large vision-language models to visual adversarial attacks by introducing FARE, an unsupervised adversarial fine-tuning scheme for the CLIP image encoder that preserves original embeddings. By optimizing a feature-based loss with PGD, FARE yields robust vision embeddings that can replace the original CLIP without retraining downstream LVLM components, improving robustness across LVLMs like OpenFlamingo and LLaVA while maintaining strong zero-shot performance. Empirically, FARE (especially at ε=4/255) outperforms the supervised TeCoA baselines in both clean and robust metrics and reduces stealthy targeted attacks and jailbreaking vulnerabilities, with added improvements in hallucination and reasoning benchmarks. The approach offers a practical, generalizable defense for deploying robust LVLMs in real-world settings without costly re-training of downstream systems.
Abstract
Multi-modal foundation models like OpenFlamingo, LLaVA, and GPT-4 are increasingly used for various real-world tasks. Prior work has shown that these models are highly vulnerable to adversarial attacks on the vision modality. These attacks can be leveraged to spread fake information or defraud users, and thus pose a significant risk, which makes the robustness of large multi-modal foundation models a pressing problem. The CLIP model, or one of its variants, is used as a frozen vision encoder in many large vision-language models (LVLMs), e.g. LLaVA and OpenFlamingo. We propose an unsupervised adversarial fine-tuning scheme to obtain a robust CLIP vision encoder, which yields robustness on all vision down-stream tasks (LVLMs, zero-shot classification) that rely on CLIP. In particular, we show that stealth-attacks on users of LVLMs by a malicious third party providing manipulated images are no longer possible once one replaces the original CLIP model with our robust one. No retraining or fine-tuning of the down-stream LVLMs is required. The code and robust models are available at https://github.com/chs20/RobustVLM
