Table of Contents
Fetching ...

Language-Driven Anchors for Zero-Shot Adversarial Robustness

Xiao Li, Wei Zhang, Yining Liu, Zhanhao Hu, Bo Zhang, Xiaolin Hu

TL;DR

This work tackles zero-shot adversarial robustness for deep networks by leveraging vision-language foundations. It introduces LAAT, a Language-driven Anchor-based Adversarial Training framework that uses fixed, semantically aligned text anchors from a CLIP text encoder, an expansion algorithm to reduce anchor cosine similarity, and an Alignment Cross-Entropy objective plus a smoothing loss to train robust image representations. The method demonstrates strong zero-shot and generalized zero-shot robustness across multiple datasets and against strong attacks, often outperforming state-of-the-art few-shot baselines and TeCoA under realistic perturbations. The approach highlights the value of semantic consistency in text anchors for robust zero-shot transfer and points to scalable robustness improvements in large multimodal models, even when labeled data for novel categories are unavailable during training.

Abstract

Deep Neural Networks (DNNs) are known to be susceptible to adversarial attacks. Previous researches mainly focus on improving adversarial robustness in the fully supervised setting, leaving the challenging domain of zero-shot adversarial robustness an open question. In this work, we investigate this domain by leveraging the recent advances in large vision-language models, such as CLIP, to introduce zero-shot adversarial robustness to DNNs. We propose LAAT, a Language-driven, Anchor-based Adversarial Training strategy. LAAT utilizes the features of a text encoder for each category as fixed anchors (normalized feature embeddings) for each category, which are then employed for adversarial training. By leveraging the semantic consistency of the text encoders, LAAT aims to enhance the adversarial robustness of the image model on novel categories. However, naively using text encoders leads to poor results. Through analysis, we identified the issue to be the high cosine similarity between text encoders. We then design an expansion algorithm and an alignment cross-entropy loss to alleviate the problem. Our experimental results demonstrated that LAAT significantly improves zero-shot adversarial robustness over state-of-the-art methods. LAAT has the potential to enhance adversarial robustness by large-scale multimodal models, especially when labeled data is unavailable during training.

Language-Driven Anchors for Zero-Shot Adversarial Robustness

TL;DR

This work tackles zero-shot adversarial robustness for deep networks by leveraging vision-language foundations. It introduces LAAT, a Language-driven Anchor-based Adversarial Training framework that uses fixed, semantically aligned text anchors from a CLIP text encoder, an expansion algorithm to reduce anchor cosine similarity, and an Alignment Cross-Entropy objective plus a smoothing loss to train robust image representations. The method demonstrates strong zero-shot and generalized zero-shot robustness across multiple datasets and against strong attacks, often outperforming state-of-the-art few-shot baselines and TeCoA under realistic perturbations. The approach highlights the value of semantic consistency in text anchors for robust zero-shot transfer and points to scalable robustness improvements in large multimodal models, even when labeled data for novel categories are unavailable during training.

Abstract

Deep Neural Networks (DNNs) are known to be susceptible to adversarial attacks. Previous researches mainly focus on improving adversarial robustness in the fully supervised setting, leaving the challenging domain of zero-shot adversarial robustness an open question. In this work, we investigate this domain by leveraging the recent advances in large vision-language models, such as CLIP, to introduce zero-shot adversarial robustness to DNNs. We propose LAAT, a Language-driven, Anchor-based Adversarial Training strategy. LAAT utilizes the features of a text encoder for each category as fixed anchors (normalized feature embeddings) for each category, which are then employed for adversarial training. By leveraging the semantic consistency of the text encoders, LAAT aims to enhance the adversarial robustness of the image model on novel categories. However, naively using text encoders leads to poor results. Through analysis, we identified the issue to be the high cosine similarity between text encoders. We then design an expansion algorithm and an alignment cross-entropy loss to alleviate the problem. Our experimental results demonstrated that LAAT significantly improves zero-shot adversarial robustness over state-of-the-art methods. LAAT has the potential to enhance adversarial robustness by large-scale multimodal models, especially when labeled data is unavailable during training.
Paper Structure (35 sections, 8 equations, 5 figures, 11 tables, 1 algorithm)

This paper contains 35 sections, 8 equations, 5 figures, 11 tables, 1 algorithm.

Figures (5)

  • Figure 1: The illustration of zero-shot ability with LAAT. Different colors of the marks indicate different categories. When a model is adversarially trained with the text anchors of table and lion (seen categories), it can recognize adversarial examples of the two categories (grey and green). Then due to the text anchors of chair and tiger (novel categories) being close to those of table and lion, respectively, the model can also recognize the two novel categories.
  • Figure 2: The pipeline of LAAT. Only the image encoder is trainable in the figure. $\odot$ indicates computing the CoS. Red arrows indicate the adversarial example generation process and brown arrows indicate the inference.
  • Figure 3: Learning curves of AT supervised by $\cos \theta$ with fixed anchors generated from MMC method, CLIP text encoder, and the expansion algorithm (see \ref{['sec:expansion']}).
  • Figure 4: An illustration of the expansion operation in 3D space.
  • Figure 5: Classification accuracy on both benign and adversarial examples in 5-way few-shot setting on CIFAR-FS, with image feature anchors or with image-text blended anchors. The dashed line denotes 5-way zero-shot accuracy.