What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

Sen Nie; Jie Zhang; Zhongqi Wang; Zhaoyang Wei; Shiguang Shan; Xilin Chen

What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

Sen Nie, Jie Zhang, Zhongqi Wang, Zhaoyang Wei, Shiguang Shan, Xilin Chen

Abstract

Achieving adversarial robustness in Vision-Language Models (VLMs) inevitably compromises accuracy on clean data, presenting a long-standing and challenging trade-off. In this work, we revisit this trade-off by investigating a fundamental question: What makes VLMs robust? Through a detailed analysis of adversarially fine-tuned models, we examine how robustness mechanisms function internally and how they interact with clean accuracy. Our analysis reveals that adversarial robustness is not uniformly distributed across network depth. Instead, unexpectedly, it is primarily localized within the shallow layers, driven by a low-frequency spectral bias and input-insensitive attention patterns. Meanwhile, updates to the deep layers tend to undermine both clean accuracy and robust generalization. Motivated by these insights, we propose Adversarial Robustness Adaptation (R-Adapt), a simple yet effective framework that freezes all pre-trained weights and introduces minimal, insight-driven adaptations only in the initial layers. This design achieves an exceptional balance between adversarial robustness and clean accuracy. R-Adapt further supports training-free, model-guided, and data-driven paradigms, offering flexible pathways to seamlessly equip standard models with robustness. Extensive evaluations on 18 datasets and diverse tasks demonstrate our state-of-the-art performance under various attacks. Notably, R-Adapt generalizes efficiently to large vision-language models (e.g., LLaVA and Qwen-VL) to enhance their robustness. Our project page is available at https://summu77.github.io/R-Adapt.

What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

Abstract

Paper Structure (19 sections, 8 equations, 14 figures, 11 tables)

This paper contains 19 sections, 8 equations, 14 figures, 11 tables.

Introduction
Preliminaries and Related Work
Insights over Adversarially Fine-tuned VLMs
Shallow Layers as the Primary Driver of Adversarial Robustness
Two-Stage Robustness in Early Layers: Low-Pass Filtering and Input-Insensitive Attention Pattern
Discussions on the Robustness-Accuracy Trade-off
Methodology
R-Adapt: Towards Reconciling Robustness and Accuracy
Acquisition of the Robustness Anchor
Experiments
Setup
Main results
Ablation Study
Conclusion
Additional Results on Progressive Replacement
...and 4 more sections

Figures (14)

Figure 1: Drawing upon the insights from adversarially fine-tuned models, we propose the R-Adapt framework. Notably, compared to the standard AFT baseline, FARE schlarmann2024robustclip, R-Adapt$^{+}$ requires $\downarrow\mathbf{640\times}$fewer training images, yet delivers gains in both clean accuracy ($\uparrow$10.8%) and adversarial robustness ($\uparrow$4.4%), averaged across 16 classification benchmarks.
Figure 2: Layer-wise Analysis. (a) CKA analysis reveals a clear representational gap at the initial stage, indicating a fundamental functional shift within the shallow layers. (b) Progressive replacement demonstrates an early saturation of adversarial robustness, emphasizing the critical roles of the Embedding layer (denoted as $\text{Emb.}$) and the first Attention block (denoted as $\text{B1-Attn}$) in driving the model's overall robustness.
Figure 3: Visualizations. (a) The Spectral Shift Map ($\Delta \mathcal{S}$) of the embedding layer demonstrates that the FARE model amplifies the low-frequency signals while suppressing the high-frequency components. (b) The attention maps of the first block show that the FARE model develops an input-insensitive mechanism by redirecting the attention to consistent non-semantic regions (, the blank backgrounds, particularly evident in the second row).
Figure 4: We visualize the performance trends during the cumulative substitution of standard CLIP modules with weights from a model adversarially fine-tuned on ImageNet (FARE).
Figure 5: Illustration of three Robustness Anchor acquisition paradigms. (a) Training-Free: direct extraction from the standard CLIP model using a uniform white input; (b) Model-Guided: extraction from an adversarially fine-tuned model $\mathcal{M}$; (c) Data-Driven (R-Adapt$^+$): optimization of the anchor using a limited amount of adversarial data.
...and 9 more figures

What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

Abstract

What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

Authors

Abstract

Table of Contents

Figures (14)