Table of Contents
Fetching ...

Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition

Pei Peng, MingKun Xie, Hang Hao, Tong Jin, ShengJun Huang

TL;DR

This work tackles object-context shortcut biases that undermine zero-shot reliability in vision-language models by recasting the problem as causal inference. It proposes a lightweight, inference-only framework that operates in CLIP's representation space to estimate counterfactual object-context embeddings and compute a Total Direct Effect (TDE) to suppress background hallucinations while preserving beneficial interactions. By constructing counterfactuals from external scenes, batch neighbors, and text-derived descriptions, and by blending base and counterfactual TDEs, the method achieves state-of-the-art zero-shot performance on several context-sensitive benchmarks with minimal overhead and no retraining. The approach offers a practical, scalable causal pathway to debiased and reliable multimodal reasoning in open-world settings.

Abstract

Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.

Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition

TL;DR

This work tackles object-context shortcut biases that undermine zero-shot reliability in vision-language models by recasting the problem as causal inference. It proposes a lightweight, inference-only framework that operates in CLIP's representation space to estimate counterfactual object-context embeddings and compute a Total Direct Effect (TDE) to suppress background hallucinations while preserving beneficial interactions. By constructing counterfactuals from external scenes, batch neighbors, and text-derived descriptions, and by blending base and counterfactual TDEs, the method achieves state-of-the-art zero-shot performance on several context-sensitive benchmarks with minimal overhead and no retraining. The approach offers a practical, scalable causal pathway to debiased and reliable multimodal reasoning in open-world settings.

Abstract

Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.

Paper Structure

This paper contains 34 sections, 48 equations, 11 figures, 17 tables, 1 algorithm.

Figures (11)

  • Figure 1: Bias and hallucinations caused by co-occurrence in CLIP. (a) Accuracy decreases significantly from the best to worst group across backbones in Waterbirds dataset, with groups defined by different class–context co-occurrence patterns. (b) The Attention maps show image responses under the prompt “a photo of an albatross,” comparing CLIP and our counterfactual embedding $\mathcal{C}(\bm x)$, and context-only images trigger hallucinated score on ImageNet labels, highlighting vision-side bias.
  • Figure 2: Context-induced gender bias in CLIP. (a) Zero-shot accuracy gap on COCO-GB v1 across genders and contextual objects. (b) Corresponding PMI revealing co-occurrence bias pattern.
  • Figure 3: The schematic of causal graph. Dashed line denotes $X$ and $Z$ co-occurring in the same image. $R$ represents the interaction as observable features.
  • Figure 4: Overall architecture. Our method consists of two components: the upper branch computes TDE by subtracting the predictive contribution of context; the lower branch constructs counterfactual embeddings by recombining object with diverse alternative contexts embeddings, simulating intervention. Aggregating those yields a robust, decoupled prediction. The gray dashed line, and red arrows denote blocked, preserved casual effect to prediction respectively. A detailed step-by-step description of our inference pipeline can be found in Algorithm \ref{['alg:cfclip']} in Appendix \ref{['Framework Pseudo-code']}.
  • Figure 5: The attention map revealed by $\mathcal{C}(x)$-based embeddings on NICO.
  • ...and 6 more figures