Table of Contents
Fetching ...

TIMA: Text-Image Mutual Awareness for Balancing Zero-Shot Adversarial Robustness and Generalization Ability

Fengji Ma, Li Liu, Hei Victor Cheng

TL;DR

This work tackles zero-shot adversarial robustness versus zero-shot generalization in CLIP-based foundation models by introducing Text-Image Mutual Awareness (TIMA). TIMA combines two tuning pathways: Image-Aware Text (IAT) with Minimum Hyperspherical Energy (MHE) to enlarge inter-class distances in text embeddings while preserving semantics through cross-modal distillation, and Text-Aware Image (TAI) with a Text-distance based Adaptive Margin (TAM) to enlarge inter-class distances in image embeddings, complemented by Text-Aware Knowledge Distillation (TAKD). The framework jointly optimizes these components with cross-modal supervision to maintain CLIP’s semantic structure, achieving state-of-the-art zero-shot robustness under large adversarial perturbations while preserving zero-shot generalization across diverse datasets. Empirically, TIMA demonstrates superior robust accuracy under PGD-10 and AutoAttack, with improved resilience at larger perturbations and robustness across CLIP temperature settings, along with informative ablations showing the contribution of each tuning mechanism. The results suggest that increasing inter-class distances in both text and image embeddings, together with cross-modal alignment, is a key strategy for robust, generalizable multimodal foundation models in adversarial settings.

Abstract

This work addresses the challenge of achieving zero-shot adversarial robustness while preserving zero-shot generalization in large-scale foundation models, with a focus on the popular Contrastive Language-Image Pre-training (CLIP). Although foundation models were reported to have exceptional zero-shot generalization, they are highly vulnerable to adversarial perturbations. Existing methods achieve a comparable good tradeoff between zero-shot adversarial robustness and generalization under small adversarial perturbations. However, they fail to achieve a good tradeoff under large adversarial perturbations. To this end, we propose a novel Text-Image Mutual Awareness (TIMA) method that strikes a balance between zero-shot adversarial robustness and generalization. More precisely, we propose an Image-Aware Text (IAT) tuning mechanism that increases the inter-class distance of text embeddings by incorporating the Minimum Hyperspherical Energy (MHE). Simultaneously, fixed pre-trained image embeddings are used as cross-modal auxiliary supervision to maintain the similarity between the MHE-tuned and original text embeddings by the knowledge distillation, preserving semantic information between different classes. Besides, we introduce a Text-Aware Image (TAI) tuning mechanism, which increases inter-class distance between image embeddings during the training stage by Text-distance based Adaptive Margin (TAM). Similarly, a knowledge distillation is utilized to retain the similarity between fine-tuned and pre-trained image embeddings. Extensive experimental results demonstrate the effectiveness of our approach, showing impressive zero-shot performance against a wide range of adversarial perturbations while preserving the zero-shot generalization capabilities of the original CLIP model.

TIMA: Text-Image Mutual Awareness for Balancing Zero-Shot Adversarial Robustness and Generalization Ability

TL;DR

This work tackles zero-shot adversarial robustness versus zero-shot generalization in CLIP-based foundation models by introducing Text-Image Mutual Awareness (TIMA). TIMA combines two tuning pathways: Image-Aware Text (IAT) with Minimum Hyperspherical Energy (MHE) to enlarge inter-class distances in text embeddings while preserving semantics through cross-modal distillation, and Text-Aware Image (TAI) with a Text-distance based Adaptive Margin (TAM) to enlarge inter-class distances in image embeddings, complemented by Text-Aware Knowledge Distillation (TAKD). The framework jointly optimizes these components with cross-modal supervision to maintain CLIP’s semantic structure, achieving state-of-the-art zero-shot robustness under large adversarial perturbations while preserving zero-shot generalization across diverse datasets. Empirically, TIMA demonstrates superior robust accuracy under PGD-10 and AutoAttack, with improved resilience at larger perturbations and robustness across CLIP temperature settings, along with informative ablations showing the contribution of each tuning mechanism. The results suggest that increasing inter-class distances in both text and image embeddings, together with cross-modal alignment, is a key strategy for robust, generalizable multimodal foundation models in adversarial settings.

Abstract

This work addresses the challenge of achieving zero-shot adversarial robustness while preserving zero-shot generalization in large-scale foundation models, with a focus on the popular Contrastive Language-Image Pre-training (CLIP). Although foundation models were reported to have exceptional zero-shot generalization, they are highly vulnerable to adversarial perturbations. Existing methods achieve a comparable good tradeoff between zero-shot adversarial robustness and generalization under small adversarial perturbations. However, they fail to achieve a good tradeoff under large adversarial perturbations. To this end, we propose a novel Text-Image Mutual Awareness (TIMA) method that strikes a balance between zero-shot adversarial robustness and generalization. More precisely, we propose an Image-Aware Text (IAT) tuning mechanism that increases the inter-class distance of text embeddings by incorporating the Minimum Hyperspherical Energy (MHE). Simultaneously, fixed pre-trained image embeddings are used as cross-modal auxiliary supervision to maintain the similarity between the MHE-tuned and original text embeddings by the knowledge distillation, preserving semantic information between different classes. Besides, we introduce a Text-Aware Image (TAI) tuning mechanism, which increases inter-class distance between image embeddings during the training stage by Text-distance based Adaptive Margin (TAM). Similarly, a knowledge distillation is utilized to retain the similarity between fine-tuned and pre-trained image embeddings. Extensive experimental results demonstrate the effectiveness of our approach, showing impressive zero-shot performance against a wide range of adversarial perturbations while preserving the zero-shot generalization capabilities of the original CLIP model.
Paper Structure (22 sections, 9 equations, 5 figures, 4 tables)

This paper contains 22 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The framework of our proposed method Text-Image Mutual Awareness (TIMA).$\mathcal{L}_{\mathrm{MHE}}$ is Minimum Hyperspherical Energy loss, $\mathcal{L}_{\mathrm{IAKD}}$ is Image-Aware Knowledge Distillation loss, $\mathcal{L}_{\mathrm{TAW}}$ is Text-distance Adaptive Margin loss and $\mathcal{L}_{\mathrm{TAKD}}$ is Text-Aware Knowledge Distillation loss.
  • Figure 2: Text-text embedding pairs and (clean) image-text embedding pairs cosine similarity matrix on CIFAR10. The first row shows the text-text embedding pairs similarity of different methods. The second row shows the similarity among clean image-text embedding pairs for different methods.
  • Figure 3: Adversarial image vs. adversarial image embedding pairs cosine similarity matrix on CIFAR10 under large perturbation radius ($\varepsilon=8/255$) adversarial attack.
  • Figure 4: Zero-shot robust/clean accuracy of the proposed method under the CLIP temperature $\tau = 1$.
  • Figure 5: Zero-shot robust accuracy of our proposed method with different hyper-parameter $m$ and $\eta$ in our proposed Text-distance Adaptive Margin.