Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness
Zeyu Wang, Cihang Xie, Brian Bartoldson, Bhavya Kailkhura
TL;DR
The paper tackles vision_language model robustness to adversarial visual perturbations by proposing Double Visual Defense, a two_stage training strategy that first adversarially pre_trains CLIP from web_scale data to produce ΔCLIP, and then adversarially tunes LLaVA through adversarial visual instruction tuning to yield Δ^2LLaVA. This end_to_end defense surpasses prior post_hoc approaches, delivering substantial improvements in adversarial robustness across zero_shot recognition, image_captioning, and visual_question_answering while preserving open_set generalization. A second layer of robustness through adversarial autoregressive training reduces hallucinations and maintains useful reasoning capabilities, with a notable emergence of typographic attacks under strong perturbations. The approach demonstrates state_of_the_art robustness on 20+ datasets, with strong results on ImageNet_1K and downstream tasks, and is proposed as a practical drop_in replacement for vanilla CLIP and LLaVA, accompanied by code and weights release.
Abstract
This paper investigates the robustness of vision-language models against adversarial visual perturbations and introduces a novel ``double visual defense" to enhance this robustness. Unlike previous approaches that resort to lightweight adversarial fine-tuning of a pre-trained CLIP model, we perform large-scale adversarial vision-language pre-training from scratch using web-scale data. We then strengthen the defense by incorporating adversarial visual instruction tuning. The resulting models from each stage, $Δ$CLIP and $Δ^2$LLaVA, show substantially enhanced zero-shot robustness and set a new state-of-the-art in adversarial defense for vision-language models. For example, the adversarial robustness of $Δ$CLIP surpasses that of the previous best models on ImageNet-1k by ~20%. %For example, $Δ$CLIP surpasses the previous best models on ImageNet-1k by ~20% in terms of adversarial robustness. Similarly, compared to prior art, $Δ^2$LLaVA brings a ~30% robustness improvement to image captioning task and a ~20% robustness improvement to visual question answering task. Furthermore, our models exhibit stronger zero-shot recognition capability, fewer hallucinations, and superior reasoning performance compared to baselines. Our project page is https://doublevisualdefense.github.io/.
