Entropy is not Enough for Test-Time Adaptation: From the Perspective of Disentangled Factors

Jonghyun Lee; Dahuin Jung; Saehyung Lee; Junsung Park; Juhyeon Shin; Uiwon Hwang; Sungroh Yoon

Entropy is not Enough for Test-Time Adaptation: From the Perspective of Disentangled Factors

Jonghyun Lee, Dahuin Jung, Saehyung Lee, Junsung Park, Juhyeon Shin, Uiwon Hwang, Sungroh Yoon

TL;DR

The paper demonstrates that entropy alone is insufficient for reliable test-time adaptation in the presence of disentangled factors that differentially correlate with labels. It introduces PLPD, a confidence metric derived from object-shape distortions, and a TTA method DeYO that jointly uses entropy and PLPD for sample selection and weighting, emphasizing CPR factors. Across mild and wild distribution shifts on ImageNet-C, Waterbirds, ColoredMNIST, ImageNet-R, and VisDA-2021, DeYO consistently outperforms state-of-the-art baselines, with notable gains in hard scenarios and even surpassing random chance on ColoredMNIST. The approach offers practical robustness with modest computational overhead and provides insights into leveraging factor-aware signals for online adaptation.

Abstract

Test-time adaptation (TTA) fine-tunes pre-trained deep neural networks for unseen test data. The primary challenge of TTA is limited access to the entire test dataset during online updates, causing error accumulation. To mitigate it, TTA methods have utilized the model output's entropy as a confidence metric that aims to determine which samples have a lower likelihood of causing error. Through experimental studies, however, we observed the unreliability of entropy as a confidence metric for TTA under biased scenarios and theoretically revealed that it stems from the neglect of the influence of latent disentangled factors of data on predictions. Building upon these findings, we introduce a novel TTA method named Destroy Your Object (DeYO), which leverages a newly proposed confidence metric named Pseudo-Label Probability Difference (PLPD). PLPD quantifies the influence of the shape of an object on prediction by measuring the difference between predictions before and after applying an object-destructive transformation. DeYO consists of sample selection and sample weighting, which employ entropy and PLPD concurrently. For robust adaptation, DeYO prioritizes samples that dominantly incorporate shape information when making predictions. Our extensive experiments demonstrate the consistent superiority of DeYO over baseline methods across various scenarios, including biased and wild. Project page is publicly available at https://whitesnowdrop.github.io/DeYO/.

Entropy is not Enough for Test-Time Adaptation: From the Perspective of Disentangled Factors

TL;DR

Abstract

Paper Structure (36 sections, 1 theorem, 16 equations, 10 figures, 20 tables, 1 algorithm)

This paper contains 36 sections, 1 theorem, 16 equations, 10 figures, 20 tables, 1 algorithm.

Introduction
Revisiting TTA: from the perspective of disentangled factors
Motivating observations
Preliminaries
Entropy is not enough
Methodology
Sample selection
Sample weighting
Overall procedure of DeYO
Experiments
Main Results
Role and Effect of PLPD
Hyperparameter and Ablation Studies on DeYO
Conclusion
Proof of Proposition \ref{['pro:1']}
...and 21 more sections

Key Result

Proposition 1

Let us consider a pre-trained linear classifier $\mathcal{M}_{\bm{\theta}}$ that uses the latent disentangled factors ${\mathbf{v}}({\mathbf{x}})$ of sample ${\mathbf{x}}$ as input. We define a harmful sample as one that reduces the difference in the mean logits between classes when used for adaptat where $\mathcal{X}^{\mathrm{test}}_{y}=\{{\mathbf{x}} | ({\mathbf{x}},{\textnormal{y}})\in\mathcal{

Figures (10)

Figure 1: The accuracy within the worst group of the Waterbirds benchmark.
Figure 2: (a) A graph that represents accuracy by entropy levels. The lowest entropy interval 0$\sim$Q1 exhibits the lowest accuracy. (b) and (c) display Grad-CAM visualization of samples with correct and incorrect predictions with extremely low entropy, respectively.
Figure 3: Examples of a transformed image ${\mathbf{x}}'$ created by different object-destructive transformation methods. ${\mathbf{x}}$ is an example of WaterBirds.
Figure 4: The overview of DeYO. DeYO comprises sample selection (Sec. \ref{['paper/selection']}) and sample weighting (Sec. \ref{['paper/reweighting']}) mechanisms. The areas within the green box are distinguished based on entropy and PLPD intervals, with Area 4 corresponding to $S_{\bm\theta}({\mathbf{x}})$ in Sec. \ref{['paper/selection']}.
Figure 5: The Risk-Coverage curve of the worst group on Waterbirds.
...and 5 more figures

Theorems & Definitions (2)

Proposition 1
proof

Entropy is not Enough for Test-Time Adaptation: From the Perspective of Disentangled Factors

TL;DR

Abstract

Entropy is not Enough for Test-Time Adaptation: From the Perspective of Disentangled Factors

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (2)