Language-Driven Anchors for Zero-Shot Adversarial Robustness
Xiao Li, Wei Zhang, Yining Liu, Zhanhao Hu, Bo Zhang, Xiaolin Hu
TL;DR
This work tackles zero-shot adversarial robustness for deep networks by leveraging vision-language foundations. It introduces LAAT, a Language-driven Anchor-based Adversarial Training framework that uses fixed, semantically aligned text anchors from a CLIP text encoder, an expansion algorithm to reduce anchor cosine similarity, and an Alignment Cross-Entropy objective plus a smoothing loss to train robust image representations. The method demonstrates strong zero-shot and generalized zero-shot robustness across multiple datasets and against strong attacks, often outperforming state-of-the-art few-shot baselines and TeCoA under realistic perturbations. The approach highlights the value of semantic consistency in text anchors for robust zero-shot transfer and points to scalable robustness improvements in large multimodal models, even when labeled data for novel categories are unavailable during training.
Abstract
Deep Neural Networks (DNNs) are known to be susceptible to adversarial attacks. Previous researches mainly focus on improving adversarial robustness in the fully supervised setting, leaving the challenging domain of zero-shot adversarial robustness an open question. In this work, we investigate this domain by leveraging the recent advances in large vision-language models, such as CLIP, to introduce zero-shot adversarial robustness to DNNs. We propose LAAT, a Language-driven, Anchor-based Adversarial Training strategy. LAAT utilizes the features of a text encoder for each category as fixed anchors (normalized feature embeddings) for each category, which are then employed for adversarial training. By leveraging the semantic consistency of the text encoders, LAAT aims to enhance the adversarial robustness of the image model on novel categories. However, naively using text encoders leads to poor results. Through analysis, we identified the issue to be the high cosine similarity between text encoders. We then design an expansion algorithm and an alignment cross-entropy loss to alleviate the problem. Our experimental results demonstrated that LAAT significantly improves zero-shot adversarial robustness over state-of-the-art methods. LAAT has the potential to enhance adversarial robustness by large-scale multimodal models, especially when labeled data is unavailable during training.
