Table of Contents
Fetching ...

Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness

Zeyu Wang, Cihang Xie, Brian Bartoldson, Bhavya Kailkhura

TL;DR

The paper tackles vision_language model robustness to adversarial visual perturbations by proposing Double Visual Defense, a two_stage training strategy that first adversarially pre_trains CLIP from web_scale data to produce ΔCLIP, and then adversarially tunes LLaVA through adversarial visual instruction tuning to yield Δ^2LLaVA. This end_to_end defense surpasses prior post_hoc approaches, delivering substantial improvements in adversarial robustness across zero_shot recognition, image_captioning, and visual_question_answering while preserving open_set generalization. A second layer of robustness through adversarial autoregressive training reduces hallucinations and maintains useful reasoning capabilities, with a notable emergence of typographic attacks under strong perturbations. The approach demonstrates state_of_the_art robustness on 20+ datasets, with strong results on ImageNet_1K and downstream tasks, and is proposed as a practical drop_in replacement for vanilla CLIP and LLaVA, accompanied by code and weights release.

Abstract

This paper investigates the robustness of vision-language models against adversarial visual perturbations and introduces a novel ``double visual defense" to enhance this robustness. Unlike previous approaches that resort to lightweight adversarial fine-tuning of a pre-trained CLIP model, we perform large-scale adversarial vision-language pre-training from scratch using web-scale data. We then strengthen the defense by incorporating adversarial visual instruction tuning. The resulting models from each stage, $Δ$CLIP and $Δ^2$LLaVA, show substantially enhanced zero-shot robustness and set a new state-of-the-art in adversarial defense for vision-language models. For example, the adversarial robustness of $Δ$CLIP surpasses that of the previous best models on ImageNet-1k by ~20%. %For example, $Δ$CLIP surpasses the previous best models on ImageNet-1k by ~20% in terms of adversarial robustness. Similarly, compared to prior art, $Δ^2$LLaVA brings a ~30% robustness improvement to image captioning task and a ~20% robustness improvement to visual question answering task. Furthermore, our models exhibit stronger zero-shot recognition capability, fewer hallucinations, and superior reasoning performance compared to baselines. Our project page is https://doublevisualdefense.github.io/.

Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness

TL;DR

The paper tackles vision_language model robustness to adversarial visual perturbations by proposing Double Visual Defense, a two_stage training strategy that first adversarially pre_trains CLIP from web_scale data to produce ΔCLIP, and then adversarially tunes LLaVA through adversarial visual instruction tuning to yield Δ^2LLaVA. This end_to_end defense surpasses prior post_hoc approaches, delivering substantial improvements in adversarial robustness across zero_shot recognition, image_captioning, and visual_question_answering while preserving open_set generalization. A second layer of robustness through adversarial autoregressive training reduces hallucinations and maintains useful reasoning capabilities, with a notable emergence of typographic attacks under strong perturbations. The approach demonstrates state_of_the_art robustness on 20+ datasets, with strong results on ImageNet_1K and downstream tasks, and is proposed as a practical drop_in replacement for vanilla CLIP and LLaVA, accompanied by code and weights release.

Abstract

This paper investigates the robustness of vision-language models against adversarial visual perturbations and introduces a novel ``double visual defense" to enhance this robustness. Unlike previous approaches that resort to lightweight adversarial fine-tuning of a pre-trained CLIP model, we perform large-scale adversarial vision-language pre-training from scratch using web-scale data. We then strengthen the defense by incorporating adversarial visual instruction tuning. The resulting models from each stage, CLIP and LLaVA, show substantially enhanced zero-shot robustness and set a new state-of-the-art in adversarial defense for vision-language models. For example, the adversarial robustness of CLIP surpasses that of the previous best models on ImageNet-1k by ~20%. %For example, CLIP surpasses the previous best models on ImageNet-1k by ~20% in terms of adversarial robustness. Similarly, compared to prior art, LLaVA brings a ~30% robustness improvement to image captioning task and a ~20% robustness improvement to visual question answering task. Furthermore, our models exhibit stronger zero-shot recognition capability, fewer hallucinations, and superior reasoning performance compared to baselines. Our project page is https://doublevisualdefense.github.io/.
Paper Structure (31 sections, 5 equations, 4 figures, 6 tables)

This paper contains 31 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: (a) Our Double Visual Defense framework, which involves an adversarial contrastive pre-training stage and an adversarial visual instruction tuning stage. (b) Comparison of clean performance and robustness of our $\Delta$CLIP model with previous robust and non-robust CLIP models on 4 different tasks, including zero-shot recognition, image captioning, visual question answering, and hallucination. It can be seen that our $\Delta$CLIP attains drastically better robustness while maintaining clean performance close to that of the non-robust OpenAI CLIP counterpart. Note that our $\Delta^2$LLaVA shows further improved robustness upon $\Delta$CLIP on downstream VLM tasks (check section \ref{['sec:adversarial_visual_instruction_tuning']} and \ref{['sec:experiments']}). (c) $\Delta^2$LLaVA shows less degree of hallucination compared to LLaVA that are based on previous robust CLIP models like TeCoA mao2023understanding or FARE schlarmann2024robust. (d) We observe an intriguing phenomenon that typographical attack naturally emerge from naive $\ell_{\infty}$-adversarial attacks when applied to our adversarially trained $\Delta^2$LLaVA models. Best viewed when zoomed in.
  • Figure 2: Output from various models under targeted attacks from Table \ref{['tab:llava_target_attack_full']}. The right output, erroneous output, and output of successful attacks are marked in green, yellow, and red, respectively. All LLaVA models perform reasonably good on benign input. Non-robust CLIP model is susceptible to adversarial attack with both radii $\epsilon=4/255$ and $\epsilon=8/255$. TeCoA and FARE CLIP models may successfully defend against attacks, but are more likely to result in output that is erroneous or does not accurately correlate with the input. By contrast, our $\Delta^2$LLaVA produces desired output that is close to the output given clean input, even with large attack radius $\epsilon=8/255$.
  • Figure 3: Visualization of adversarial samples generated with different target models and attack radii. Note that typographic attacks "emerge" from naive $\ell_{\infty}$-adversarial attacks when applied to the proposed robust models, especially with larger attack radii.
  • Figure 4: Visual examples from the POPE hallucination benchmark. GT-Answer is the ground truth response to the question, the red background indicates hallucination, whereas the green background shows the correct output.