Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Pei Peng, MingKun Xie, Hang Hao, Tong Jin, ShengJun Huang
TL;DR
This work tackles object-context shortcut biases that undermine zero-shot reliability in vision-language models by recasting the problem as causal inference. It proposes a lightweight, inference-only framework that operates in CLIP's representation space to estimate counterfactual object-context embeddings and compute a Total Direct Effect (TDE) to suppress background hallucinations while preserving beneficial interactions. By constructing counterfactuals from external scenes, batch neighbors, and text-derived descriptions, and by blending base and counterfactual TDEs, the method achieves state-of-the-art zero-shot performance on several context-sensitive benchmarks with minimal overhead and no retraining. The approach offers a practical, scalable causal pathway to debiased and reliable multimodal reasoning in open-world settings.
Abstract
Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.
