Table of Contents
Fetching ...

AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models

Yubo Cui, Xianchao Guan, Zijun Xiong, Zheng Zhang

Abstract

Pre-trained vision-language models (VLMs) exhibit strong zero-shot generalization but remain vulnerable to adversarial perturbations. Existing classification-guided adversarial fine-tuning methods often disrupt pre-trained cross-modal alignment, weakening visual-textual correspondence and degrading zero-shot performance. In this paper, we propose an Alignment-Guided Fine-Tuning (AGFT) framework that enhances zero-shot adversarial robustness while preserving the cross-modal semantic structure. Unlike label-based methods that rely on hard labels and fail to maintain the relative relationships between image and text, AGFT leverages the probabilistic predictions of the original model for text-guided adversarial training, which aligns adversarial visual features with textual embeddings via soft alignment distributions, improving zero-shot adversarial robustness. To address structural discrepancies introduced by fine-tuning, we introduce a distribution consistency calibration mechanism that adjusts the robust model output to match a temperature-scaled version of the pre-trained model predictions. Extensive experiments across multiple zero-shot benchmarks demonstrate that AGFT outperforms state-of-the-art methods while significantly improving zero-shot adversarial robustness.

AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models

Abstract

Pre-trained vision-language models (VLMs) exhibit strong zero-shot generalization but remain vulnerable to adversarial perturbations. Existing classification-guided adversarial fine-tuning methods often disrupt pre-trained cross-modal alignment, weakening visual-textual correspondence and degrading zero-shot performance. In this paper, we propose an Alignment-Guided Fine-Tuning (AGFT) framework that enhances zero-shot adversarial robustness while preserving the cross-modal semantic structure. Unlike label-based methods that rely on hard labels and fail to maintain the relative relationships between image and text, AGFT leverages the probabilistic predictions of the original model for text-guided adversarial training, which aligns adversarial visual features with textual embeddings via soft alignment distributions, improving zero-shot adversarial robustness. To address structural discrepancies introduced by fine-tuning, we introduce a distribution consistency calibration mechanism that adjusts the robust model output to match a temperature-scaled version of the pre-trained model predictions. Extensive experiments across multiple zero-shot benchmarks demonstrate that AGFT outperforms state-of-the-art methods while significantly improving zero-shot adversarial robustness.

Paper Structure

This paper contains 18 sections, 16 equations, 4 figures, 15 tables.

Figures (4)

  • Figure 1: The performance of AGFT is shown in \ref{['fig:1']}(a). Figures \ref{['fig:1']}(b) and \ref{['fig:1']}(c) compare classification-guided and alignment-guided adversarial fine-tuning. Unlike classification-guided methods, which rely on label supervision and can disrupt the pre-trained cross-modal alignment, AGFT leverages the probabilistic predictions of the original model to preserve the cross-modal semantic structure, enhancing zero-shot adversarial robustness.
  • Figure 2: The overall pipeline of AGFT. First, we obtain the probabilistic predictions of the pre-trained model and use the resulting distribution as the target for adversarial fine-tuning to encourage adversarial visual features to align with textual embeddings. To mitigate the discrepancies in visual–textual semantic structure, we calibrate the pre-trained output distribution through temperature adjustment, while maintaining the cross-modal similarity structure across images and textual descriptions.
  • Figure 3: Trade-off between robust and clean accuracy across different methods. Each marker type denotes one method, and each point corresponds to a different trade-off configuration.
  • Figure 4: T-SNE visualization of 7 image categories. (a) The original pre-trained CLIP evaluated on clean images. Both (b) AGFT and (c) TeCoA are evaluated on adversarial examples.