Zero-Shot Robustness of Vision Language Models Via Confidence-Aware Weighting
Nikoo Naghavian, Mostafa Tavassolipour
TL;DR
Problem: zero-shot robustness of vision-language models (e.g., CLIP) under imperceptible adversarial perturbations. Approach: CAW combines a Confidence-Aware loss, defined via $L_{CA}$ with $P^{\text{adv}}$ and $P^{\text{clean}}$ and KL divergence, and a Regularization term $L_{Reg}$ that aligns adversarial image features from the fine-tuned and frozen encoders; training optimizes $L_{total}=L_{CE}+\alpha L_{CA}+\beta L_{Reg}$. Contributions: 1) improves robustness against AutoAttack on TinyImageNet and 14 zero-shot datasets, 2) outperforms PMG-AFT and TGA-ZSR under PGD-100 and CW, with reduced memory usage, 3) demonstrates strong generalization across diverse datasets. Significance: enables safer deployment of vision-language models by achieving higher robust and clean accuracy with lower memory footprints.
Abstract
Vision-language models like CLIP demonstrate impressive zero-shot generalization but remain highly vulnerable to adversarial attacks. In this work, we propose Confidence-Aware Weighting (CAW) to enhance zero-shot robustness in vision-language models. CAW consists of two components: (1) a Confidence-Aware loss that prioritizes uncertain adversarial examples by scaling the KL divergence between clean and adversarial predictions, and (2) a feature alignment regularization that preserves semantic consistency by minimizing the distance between frozen and fine-tuned image encoder features on adversarial inputs. These components work jointly to improve both clean and robust accuracy without sacrificing generalization. Extensive experiments on TinyImageNet and 14 additional datasets show that CAW outperforms recent methods such as PMG-AFT and TGA-ZSR under strong attacks like AutoAttack, while using less memory.
