Table of Contents
Fetching ...

Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models

Lu Yu, Haiyang Zhang, Changsheng Xu

TL;DR

Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR) is proposed, which combines the text-guided attention obtained from the target model via adversarial examples with the text-guided attention acquired from the original model via clean examples to enhance the model's robustness.

Abstract

Due to the impressive zero-shot capabilities, pre-trained vision-language models (e.g. CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible to adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: the Attention Refinement module and the Attention-based Model Constraint module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness: The Attention Refinement module aligns the text-guided attention obtained from the target model via adversarial examples with the text-guided attention acquired from the original model via clean examples. This alignment enhances the model's robustness. Additionally, the Attention-based Model Constraint module acquires text-guided attention from both the target and original models using clean examples. Its objective is to maintain model performance on clean samples while enhancing overall robustness. The experiments validate that our method yields a 9.58% enhancement in zero-shot robust accuracy over the current state-of-the-art techniques across 16 datasets. Our code is available at https://github.com/zhyblue424/TGA-ZSR.

Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models

TL;DR

Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR) is proposed, which combines the text-guided attention obtained from the target model via adversarial examples with the text-guided attention acquired from the original model via clean examples to enhance the model's robustness.

Abstract

Due to the impressive zero-shot capabilities, pre-trained vision-language models (e.g. CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible to adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: the Attention Refinement module and the Attention-based Model Constraint module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness: The Attention Refinement module aligns the text-guided attention obtained from the target model via adversarial examples with the text-guided attention acquired from the original model via clean examples. This alignment enhances the model's robustness. Additionally, the Attention-based Model Constraint module acquires text-guided attention from both the target and original models using clean examples. Its objective is to maintain model performance on clean samples while enhancing overall robustness. The experiments validate that our method yields a 9.58% enhancement in zero-shot robust accuracy over the current state-of-the-art techniques across 16 datasets. Our code is available at https://github.com/zhyblue424/TGA-ZSR.

Paper Structure

This paper contains 16 sections, 8 equations, 4 figures, 20 tables.

Figures (4)

  • Figure 1: The four rows depict the original image, its associated attention map, the generated adversarial example, and the attention map of the adversarial example. Labels in black indicate the ground truth, while those in red represent mis-classified labels for the adversarial examples.
  • Figure 2: An overview of our TGA-ZSR framework: We generate adversarial examples and feed them into the target image encoder. To enhance the adversarial robustness of the CLIP model and maintain its generalization, we introduce text-guided attention. This involves refining the framework for adversarial examples through the Attention Refinement module and constraining the model to prevent significant drift via the Attention-based Model Constraint module.
  • Figure 3: Ablation study on each component. After adversarial fine-tuning the model using adversarial examples generated by PGD-2, we verify the robustness of the model using adversarial examples generated by PGD-100.
  • Figure 4: The trade-off between robustness and clean accuracy. Each point on the graph represents a method, with the size of the point indicating the extent to which it achieves a favorable trade-off between robustness and clean accuracy.