Table of Contents
Fetching ...

Fair Context Learning for Evidence-Balanced Test-Time Adaptation in Vision-Language Models

Sanggeon Yun, Ryozo Masukawa, SungHeon Jeong, Wenjun Huang, Hanning Chen, Mohsen Imani

TL;DR

This work tackles robustness gaps of vision–language models under distribution shifts by moving beyond entropy-based test-time adaptation. It introduces Fair Context Learning (FCL), a two-stage framework that first explores plausible class candidates via low-entropy augmented views and then calibrates text contexts to balance sensitivity to shared visual evidence using common-evidence maps. The calibration objective combines a Jensen–Shannon divergence term with a semantic-alignment regularizer, enabling non–entropy-based adaptation that mitigates partial feature obsession. Empirical results across natural shifts and fine-grained datasets demonstrate competitive gains with improved fairness, efficiency, and generalization, validated by extensive ablations and qualitative analyses.

Abstract

Vision-Language Models (VLMs) such as CLIP enable strong zero-shot recognition but suffer substantial degradation under distribution shifts. Test-Time Adaptation (TTA) aims to improve robustness using only unlabeled test samples, yet most prompt-based TTA methods rely on entropy minimization -- an approach that can amplify spurious correlations and induce overconfident errors when classes share visual features. We propose Fair Context Learning (FCL), an episodic TTA framework that avoids entropy minimization by explicitly addressing shared-evidence bias. Motivated by our additive evidence decomposition assumption, FCL decouples adaptation into (i) augmentation-based exploration to identify plausible class candidates, and (ii) fairness-driven calibration that adapts text contexts to equalize sensitivity to common visual evidence. This fairness constraint mitigates partial feature obsession and enables effective calibration of text embeddings without relying on entropy reduction. Through extensive evaluation, we empirically validate our theoretical motivation and show that FCL achieves competitive adaptation performance relative to state-of-the-art TTA methods across diverse domain-shift and fine-grained benchmarks.

Fair Context Learning for Evidence-Balanced Test-Time Adaptation in Vision-Language Models

TL;DR

This work tackles robustness gaps of vision–language models under distribution shifts by moving beyond entropy-based test-time adaptation. It introduces Fair Context Learning (FCL), a two-stage framework that first explores plausible class candidates via low-entropy augmented views and then calibrates text contexts to balance sensitivity to shared visual evidence using common-evidence maps. The calibration objective combines a Jensen–Shannon divergence term with a semantic-alignment regularizer, enabling non–entropy-based adaptation that mitigates partial feature obsession. Empirical results across natural shifts and fine-grained datasets demonstrate competitive gains with improved fairness, efficiency, and generalization, validated by extensive ablations and qualitative analyses.

Abstract

Vision-Language Models (VLMs) such as CLIP enable strong zero-shot recognition but suffer substantial degradation under distribution shifts. Test-Time Adaptation (TTA) aims to improve robustness using only unlabeled test samples, yet most prompt-based TTA methods rely on entropy minimization -- an approach that can amplify spurious correlations and induce overconfident errors when classes share visual features. We propose Fair Context Learning (FCL), an episodic TTA framework that avoids entropy minimization by explicitly addressing shared-evidence bias. Motivated by our additive evidence decomposition assumption, FCL decouples adaptation into (i) augmentation-based exploration to identify plausible class candidates, and (ii) fairness-driven calibration that adapts text contexts to equalize sensitivity to common visual evidence. This fairness constraint mitigates partial feature obsession and enables effective calibration of text embeddings without relying on entropy reduction. Through extensive evaluation, we empirically validate our theoretical motivation and show that FCL achieves competitive adaptation performance relative to state-of-the-art TTA methods across diverse domain-shift and fine-grained benchmarks.
Paper Structure (68 sections, 26 equations, 5 figures, 13 tables)

This paper contains 68 sections, 26 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Conceptual overview of Fair Context Learning. Augmented views probe class-aligned evidence features. Our approach first explores candidate classes through confident view scoring and then calibrates the text embeddings in feature space using common-evidence maps, mitigating partial feature obsession.
  • Figure 2: The pipeline of the proposed test-time adaptation framework via fair context learning. Given a test image, the method first performs an exploration stage that identifies $K$ candidate classes $\mathcal{C}_K$ through augmented-view evaluation and low-entropy majority voting. Next, in the calibration stage, the model learns a fair context by estimating pairwise common feature regions among candidate classes and optimizing their score distributions toward uniformity via backpropagation to update the soft prompt $\delta$. Finally, the learned context $\delta^\star$ is applied for prediction by applying final exploration with the candidate classes $\mathcal{C}_K$.
  • Figure 3: Empirical validation of our theory. (a) Partial Feature Obsession: score distributions and evidence maps show misclassifications driven by shared features. (b) Common Evidence Contribution: ECEC is consistently lower for correct samples across ImageNet variants. (c) Entropy–Uniqueness Relation: EUEC negatively correlates with entropy, indicating stronger unique evidence at low entropy. Shaded regions show 95% CIs. Kernel density estimations are Gaussian with 400 bootstrap resamples.
  • Figure 4: Impact of the calibration stage. Top-1 accuracy gains from adding calibration on top of exploration across fine-grained datasets.
  • Figure 5: Visualization of estimated class evidence maps for randomly sampled ImageNet-A images. For each example, we show the original image (left), the ground-truth class evidence map (top right), and the evidence map of a competing incorrect class (bottom right). Brighter regions indicate areas the model relies on more heavily when making its prediction.