Table of Contents
Fetching ...

Revisiting the Robust Generalization of Adversarial Prompt Tuning

Fan Yang, Mingxuan Xia, Sangzhou Xia, Chicheng Ma, Hui Hui

TL;DR

This work tackles the robustness of vision-language foundation models to adversarial prompts and the over-fitting seen in adversarial prompt tuning. It introduces CAPT, a multi-modal prompt tuning framework with an adaptive consistency objective that leverages a frozen CLIP as a regularizer, formalized as $L_{\rm CAPT} = {\rm CE}({\rm sft}(z^I \cdot z^T/\tau), y) + \lambda L_{\rm adv-cons}$ with $L_{\rm adv-cons} = (1-\alpha_{\rm cons})L_{\rm cons-train} + \alpha_{\rm cons} L_{\rm cons-frz}$ and $L_{\rm cons-frz} = {\rm KL}( {\rm sft}(z^{I_a} \cdot z^T/\tau), {\rm sft}(z_{frz}^I \cdot z_{frz}^T/\tau))$. Through deep, cross-modal prompting and adaptive weighting, CAPT achieves improved robust generalization on adversarial inputs while preserving clean accuracy, validated across 14 datasets and multiple data-sparsity regimes, with strong cross-dataset transfer under distribution shifts thanks to frozen CLIP guidance. The approach advances zero-shot robustness in vision-language models and offers a scalable, efficient path for robust deployment in downstream tasks. The results suggest practical impact in real-world settings where adversarial resilience and generalization are critical.

Abstract

Understanding the vulnerability of large-scale pre-trained vision-language models like CLIP against adversarial attacks is key to ensuring zero-shot generalization capacity on various downstream tasks. State-of-the-art defense mechanisms generally adopt prompt learning strategies for adversarial fine-tuning to improve the adversarial robustness of the pre-trained model while keeping the efficiency of adapting to downstream tasks. Such a setup leads to the problem of over-fitting which impedes further improvement of the model's generalization capacity on both clean and adversarial examples. In this work, we propose an adaptive Consistency-guided Adversarial Prompt Tuning (i.e., CAPT) framework that utilizes multi-modal prompt learning to enhance the alignment of image and text features for adversarial examples and leverage the strong generalization of pre-trained CLIP to guide the model-enhancing its robust generalization on adversarial examples while maintaining its accuracy on clean ones. We also design a novel adaptive consistency objective function to balance the consistency of adversarial inputs and clean inputs between the fine-tuning model and the pre-trained model. We conduct extensive experiments across 14 datasets and 4 data sparsity schemes (from 1-shot to full training data settings) to show the superiority of CAPT over other state-of-the-art adaption methods. CAPT demonstrated excellent performance in terms of the in-distribution performance and the generalization under input distribution shift and across datasets.

Revisiting the Robust Generalization of Adversarial Prompt Tuning

TL;DR

This work tackles the robustness of vision-language foundation models to adversarial prompts and the over-fitting seen in adversarial prompt tuning. It introduces CAPT, a multi-modal prompt tuning framework with an adaptive consistency objective that leverages a frozen CLIP as a regularizer, formalized as with and . Through deep, cross-modal prompting and adaptive weighting, CAPT achieves improved robust generalization on adversarial inputs while preserving clean accuracy, validated across 14 datasets and multiple data-sparsity regimes, with strong cross-dataset transfer under distribution shifts thanks to frozen CLIP guidance. The approach advances zero-shot robustness in vision-language models and offers a scalable, efficient path for robust deployment in downstream tasks. The results suggest practical impact in real-world settings where adversarial resilience and generalization are critical.

Abstract

Understanding the vulnerability of large-scale pre-trained vision-language models like CLIP against adversarial attacks is key to ensuring zero-shot generalization capacity on various downstream tasks. State-of-the-art defense mechanisms generally adopt prompt learning strategies for adversarial fine-tuning to improve the adversarial robustness of the pre-trained model while keeping the efficiency of adapting to downstream tasks. Such a setup leads to the problem of over-fitting which impedes further improvement of the model's generalization capacity on both clean and adversarial examples. In this work, we propose an adaptive Consistency-guided Adversarial Prompt Tuning (i.e., CAPT) framework that utilizes multi-modal prompt learning to enhance the alignment of image and text features for adversarial examples and leverage the strong generalization of pre-trained CLIP to guide the model-enhancing its robust generalization on adversarial examples while maintaining its accuracy on clean ones. We also design a novel adaptive consistency objective function to balance the consistency of adversarial inputs and clean inputs between the fine-tuning model and the pre-trained model. We conduct extensive experiments across 14 datasets and 4 data sparsity schemes (from 1-shot to full training data settings) to show the superiority of CAPT over other state-of-the-art adaption methods. CAPT demonstrated excellent performance in terms of the in-distribution performance and the generalization under input distribution shift and across datasets.
Paper Structure (11 sections, 11 equations, 1 figure, 4 tables)

This paper contains 11 sections, 11 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: The overview of our adaptive Consistency-guided Adversarial Prompt Tuning framework. Our method adopts multi-modal prompt learning to improve the alignment between visual and textual features for adversarial examples and enhance the robustness of image and text encoder during training. we introduce a frozen pre-trained CLIP to tackle with the over-fitting issue of adversarial fine-tuning and improve the zero-shot adversarial robustness.