Table of Contents
Fetching ...

Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

Christian Schlarmann, Naman Deep Singh, Francesco Croce, Matthias Hein

TL;DR

This work tackles the vulnerability of large vision-language models to visual adversarial attacks by introducing FARE, an unsupervised adversarial fine-tuning scheme for the CLIP image encoder that preserves original embeddings. By optimizing a feature-based loss with PGD, FARE yields robust vision embeddings that can replace the original CLIP without retraining downstream LVLM components, improving robustness across LVLMs like OpenFlamingo and LLaVA while maintaining strong zero-shot performance. Empirically, FARE (especially at ε=4/255) outperforms the supervised TeCoA baselines in both clean and robust metrics and reduces stealthy targeted attacks and jailbreaking vulnerabilities, with added improvements in hallucination and reasoning benchmarks. The approach offers a practical, generalizable defense for deploying robust LVLMs in real-world settings without costly re-training of downstream systems.

Abstract

Multi-modal foundation models like OpenFlamingo, LLaVA, and GPT-4 are increasingly used for various real-world tasks. Prior work has shown that these models are highly vulnerable to adversarial attacks on the vision modality. These attacks can be leveraged to spread fake information or defraud users, and thus pose a significant risk, which makes the robustness of large multi-modal foundation models a pressing problem. The CLIP model, or one of its variants, is used as a frozen vision encoder in many large vision-language models (LVLMs), e.g. LLaVA and OpenFlamingo. We propose an unsupervised adversarial fine-tuning scheme to obtain a robust CLIP vision encoder, which yields robustness on all vision down-stream tasks (LVLMs, zero-shot classification) that rely on CLIP. In particular, we show that stealth-attacks on users of LVLMs by a malicious third party providing manipulated images are no longer possible once one replaces the original CLIP model with our robust one. No retraining or fine-tuning of the down-stream LVLMs is required. The code and robust models are available at https://github.com/chs20/RobustVLM

Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

TL;DR

This work tackles the vulnerability of large vision-language models to visual adversarial attacks by introducing FARE, an unsupervised adversarial fine-tuning scheme for the CLIP image encoder that preserves original embeddings. By optimizing a feature-based loss with PGD, FARE yields robust vision embeddings that can replace the original CLIP without retraining downstream LVLM components, improving robustness across LVLMs like OpenFlamingo and LLaVA while maintaining strong zero-shot performance. Empirically, FARE (especially at ε=4/255) outperforms the supervised TeCoA baselines in both clean and robust metrics and reduces stealthy targeted attacks and jailbreaking vulnerabilities, with added improvements in hallucination and reasoning benchmarks. The approach offers a practical, generalizable defense for deploying robust LVLMs in real-world settings without costly re-training of downstream systems.

Abstract

Multi-modal foundation models like OpenFlamingo, LLaVA, and GPT-4 are increasingly used for various real-world tasks. Prior work has shown that these models are highly vulnerable to adversarial attacks on the vision modality. These attacks can be leveraged to spread fake information or defraud users, and thus pose a significant risk, which makes the robustness of large multi-modal foundation models a pressing problem. The CLIP model, or one of its variants, is used as a frozen vision encoder in many large vision-language models (LVLMs), e.g. LLaVA and OpenFlamingo. We propose an unsupervised adversarial fine-tuning scheme to obtain a robust CLIP vision encoder, which yields robustness on all vision down-stream tasks (LVLMs, zero-shot classification) that rely on CLIP. In particular, we show that stealth-attacks on users of LVLMs by a malicious third party providing manipulated images are no longer possible once one replaces the original CLIP model with our robust one. No retraining or fine-tuning of the down-stream LVLMs is required. The code and robust models are available at https://github.com/chs20/RobustVLM
Paper Structure (33 sections, 2 theorems, 15 equations, 6 figures, 11 tables)

This paper contains 33 sections, 2 theorems, 15 equations, 6 figures, 11 tables.

Key Result

Theorem 3.1

Let $\mathop{\phi_{\rm{Org}}}\nolimits,\mathop{\phi_{\rm{FT}}}\nolimits$ be the original and fine-tuned image embeddings and $\psi$ the text embedding of CLIP. Then

Figures (6)

  • Figure 1: (Robust) performance of LLaVA-1.5 on vision-language tasks and zero-shot (robust) classification for different CLIP models as vision encoder:(i) the original CLIP, (ii) TeCoA2: robust CLIP with supervised adversarial fine-tuning Mao2022UnderstandingZAtecoa at $\ell_\infty$ radius of $2/255$, and (iii) FARE2: robust CLIP using our proposed unsupervised adversarial fine-tuning at $\ell_\infty$ radius of $2/255$. The original CLIP is completely non-robust. Our FARE2 model has better clean and robust performance than TeCoA2 on almost all down-stream tasks, see Fig. \ref{['fig:teaser-attack']} for qualitative outputs.
  • Figure 2: Illustration of targeted $\ell_\infty$-attacks with $\varepsilon=4/255$ on LLaVA when using different CLIP models as vision encoder in LLaVA: Original CLIP is highly susceptible to targeted imperceptible adversarial attacks. Using the supervised adversarially fine-tuned TeCoA4-CLIP encoder (trained at $4/255$), LLaVA becomes robust against the attack but the output is of lower quality even on the original image. With our unsupervised adversarially fine-tuned FARE4-CLIP encoder (trained at $4/255$), LLaVA becomes robust against the attack and the output is of high quality. See Fig. \ref{['fig:targeted-attack']} for more examples.
  • Figure 3: Stealthy targeted $\ell_\infty$-attacks at $\varepsilon=4/255$. We show outcomes (good outputs, outputs with mistakes and successful attacks) of the targeted attacks from \ref{['tab:targeted-attack']}. LLaVA with CLIP performs well on benign images (left), but outputs the target string of the attacker on adversarially perturbed images irrespectively of the original image content (right). LLaVA with TeCoA4-CLIP is not susceptible to the attack but the generated captions are of worse quality even on benign images. LLaVA with our FARE4-CLIP is equally robust against the attack but has high performance on benign input and its captions under the attack are quite similar to the ones for the benign input.
  • Figure 4: Visual examples from the POPE hallucination benchmark. The model is queried with a question and prompted to answer "Yes" or "No". GT-Answer is the ground truth response to the question, the red background indicate hallucination whereas the green background shows correct output.
  • Figure 5: Qualitative results for stealthy targeted attacks ($\varepsilon_\infty=4/255$) on image captioning using LLaVA for different employed CLIP models: for each of the 6 target captions we show two randomly chosen images from the 25 respective attacked images (one per sequence is shown in Fig. \ref{['fig:targeted-attack']}). The overall success rate for the original CLIP model is 100%, see Table \ref{['tab:targeted-attack']}, whereas all robust CLIP models are not susceptible to the attack.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Theorem 3.1
  • Theorem 1.1
  • proof