Revisiting the Robust Generalization of Adversarial Prompt Tuning
Fan Yang, Mingxuan Xia, Sangzhou Xia, Chicheng Ma, Hui Hui
TL;DR
This work tackles the robustness of vision-language foundation models to adversarial prompts and the over-fitting seen in adversarial prompt tuning. It introduces CAPT, a multi-modal prompt tuning framework with an adaptive consistency objective that leverages a frozen CLIP as a regularizer, formalized as $L_{\rm CAPT} = {\rm CE}({\rm sft}(z^I \cdot z^T/\tau), y) + \lambda L_{\rm adv-cons}$ with $L_{\rm adv-cons} = (1-\alpha_{\rm cons})L_{\rm cons-train} + \alpha_{\rm cons} L_{\rm cons-frz}$ and $L_{\rm cons-frz} = {\rm KL}( {\rm sft}(z^{I_a} \cdot z^T/\tau), {\rm sft}(z_{frz}^I \cdot z_{frz}^T/\tau))$. Through deep, cross-modal prompting and adaptive weighting, CAPT achieves improved robust generalization on adversarial inputs while preserving clean accuracy, validated across 14 datasets and multiple data-sparsity regimes, with strong cross-dataset transfer under distribution shifts thanks to frozen CLIP guidance. The approach advances zero-shot robustness in vision-language models and offers a scalable, efficient path for robust deployment in downstream tasks. The results suggest practical impact in real-world settings where adversarial resilience and generalization are critical.
Abstract
Understanding the vulnerability of large-scale pre-trained vision-language models like CLIP against adversarial attacks is key to ensuring zero-shot generalization capacity on various downstream tasks. State-of-the-art defense mechanisms generally adopt prompt learning strategies for adversarial fine-tuning to improve the adversarial robustness of the pre-trained model while keeping the efficiency of adapting to downstream tasks. Such a setup leads to the problem of over-fitting which impedes further improvement of the model's generalization capacity on both clean and adversarial examples. In this work, we propose an adaptive Consistency-guided Adversarial Prompt Tuning (i.e., CAPT) framework that utilizes multi-modal prompt learning to enhance the alignment of image and text features for adversarial examples and leverage the strong generalization of pre-trained CLIP to guide the model-enhancing its robust generalization on adversarial examples while maintaining its accuracy on clean ones. We also design a novel adaptive consistency objective function to balance the consistency of adversarial inputs and clean inputs between the fine-tuning model and the pre-trained model. We conduct extensive experiments across 14 datasets and 4 data sparsity schemes (from 1-shot to full training data settings) to show the superiority of CAPT over other state-of-the-art adaption methods. CAPT demonstrated excellent performance in terms of the in-distribution performance and the generalization under input distribution shift and across datasets.
