Table of Contents
Fetching ...

SAGE: Spuriousness-Aware Guided Prompt Exploration for Mitigating Multimodal Bias

Wenqian Ye, Di Wang, Guangtao Zheng, Bohan Liu, Aidong Zhang

TL;DR

This paper tackles multimodal spurious bias in zero-shot vision-language models like CLIP, where background or contextual cues can mislead object-level predictions. It introduces Spuriousness-Aware Guided Exploration (SAGE), a training-free method that searches a diverse set of prompt templates and selects those that maximize inter-class separation in the joint image-text space, thereby reducing reliance on spurious features. The authors provide a theoretical analysis linking higher class-separation in prompt-induced embeddings to robustness, and validate SAGE across four real-world benchmarks and five backbone models, showing improvements in zero-shot accuracy and worst-group robustness without any data annotations or model updates. Overall, SAGE offers a practical out-of-the-box debiasing approach for CLIP-like systems with strong generalization, balancing accuracy and fairness in zero-shot inference.

Abstract

Large vision-language models, such as CLIP, have shown strong zero-shot classification performance by aligning images and text in a shared embedding space. However, CLIP models often develop multimodal spurious biases, which is the undesirable tendency to rely on spurious features. For example, CLIP may infer object types in images based on frequently co-occurring backgrounds rather than the object's core features. This bias significantly impairs the robustness of pre-trained CLIP models on out-of-distribution data, where such cross-modal associations no longer hold. Existing methods for mitigating multimodal spurious bias typically require fine-tuning on downstream data or prior knowledge of the bias, which undermines the out-of-the-box usability of CLIP. In this paper, we first theoretically analyze the impact of multimodal spurious bias in zero-shot classification. Based on this insight, we propose Spuriousness-Aware Guided Exploration (SAGE), a simple and effective method that mitigates spurious bias through guided prompt selection. SAGE requires no training, fine-tuning, or external annotations. It explores a space of prompt templates and selects the prompts that induce the largest semantic separation between classes, thereby improving worst-group robustness. Extensive experiments on four real-world benchmark datasets and five popular backbone models demonstrate that SAGE consistently improves zero-shot performance and generalization, outperforming previous zero-shot approaches without any external knowledge or model updates.

SAGE: Spuriousness-Aware Guided Prompt Exploration for Mitigating Multimodal Bias

TL;DR

This paper tackles multimodal spurious bias in zero-shot vision-language models like CLIP, where background or contextual cues can mislead object-level predictions. It introduces Spuriousness-Aware Guided Exploration (SAGE), a training-free method that searches a diverse set of prompt templates and selects those that maximize inter-class separation in the joint image-text space, thereby reducing reliance on spurious features. The authors provide a theoretical analysis linking higher class-separation in prompt-induced embeddings to robustness, and validate SAGE across four real-world benchmarks and five backbone models, showing improvements in zero-shot accuracy and worst-group robustness without any data annotations or model updates. Overall, SAGE offers a practical out-of-the-box debiasing approach for CLIP-like systems with strong generalization, balancing accuracy and fairness in zero-shot inference.

Abstract

Large vision-language models, such as CLIP, have shown strong zero-shot classification performance by aligning images and text in a shared embedding space. However, CLIP models often develop multimodal spurious biases, which is the undesirable tendency to rely on spurious features. For example, CLIP may infer object types in images based on frequently co-occurring backgrounds rather than the object's core features. This bias significantly impairs the robustness of pre-trained CLIP models on out-of-distribution data, where such cross-modal associations no longer hold. Existing methods for mitigating multimodal spurious bias typically require fine-tuning on downstream data or prior knowledge of the bias, which undermines the out-of-the-box usability of CLIP. In this paper, we first theoretically analyze the impact of multimodal spurious bias in zero-shot classification. Based on this insight, we propose Spuriousness-Aware Guided Exploration (SAGE), a simple and effective method that mitigates spurious bias through guided prompt selection. SAGE requires no training, fine-tuning, or external annotations. It explores a space of prompt templates and selects the prompts that induce the largest semantic separation between classes, thereby improving worst-group robustness. Extensive experiments on four real-world benchmark datasets and five popular backbone models demonstrate that SAGE consistently improves zero-shot performance and generalization, outperforming previous zero-shot approaches without any external knowledge or model updates.

Paper Structure

This paper contains 22 sections, 1 theorem, 13 equations, 8 figures, 4 tables.

Key Result

Theorem 1

Consider a pre-trained CLIP model from which we obtain two text representations $\mathbf{u}_1$, $\mathbf{u}_2$ for the classes $c_1$ and $c_2$ respectively, an image representation $\mathbf{v}$ with the class label $c_2$, and a textual spurious feature $\mathbf{u}_s$ related to $\mathbf{v}$. Assume

Figures (8)

  • Figure 1: Prompts with greater separation between class similarity scores (e.g., prompt_k) yield robust zero-shot performance under spurious correlations, whereas those with smaller score differences (e.g., prompt_1) tend to yield poorer discrimination and worst-group performance.
  • Figure 2: Method overview. (a) Illustration of multimodal spurious bias, where $c_2$ denotes a class label, $\mathbf{v}$ denotes an image representation, $\mathbf{u}_s$ denotes a textual spurious feature, $\mathbf{u}_1$ and $\mathbf{u}_2$ denote text representations for the class $c_1$ and $c_2$ respectively. (b) For each test image, we evaluate $M$ prompt templates and compute a separation score that measures how well each prompt distinguishes between classes in the joint image-text space. The top-$K$ templates with the highest scores are selected. (c) Zero-shot classification is then performed by ensembling predictions from the $K$ class-discriminative prompts selected for that image.
  • Figure 3: Pearson correlation analysis of separation scores and WGA on CelebA across five backbone models. Each scatter plot shows the relationship between the score assigned to a candidate template and its corresponding WGA in zero-shot inference. The consistent positive correlation observed across all models indicates that templates with higher separation scores tend to yield better worst-group performance, validating the effectiveness of our scoring method for robust template selection.
  • Figure 4: Ablation study on the effect of varying prompt numbers in different Models with our proposed method.
  • Figure 5: Most frequently selected prompt templates for each class by our method with CLIP-ViT-B/32 in the Waterbirds dataset.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Definition 1: Multimodal spurious bias
  • Theorem 1
  • proof