Table of Contents
Fetching ...

AETTA: Label-Free Accuracy Estimation for Test-Time Adaptation

Taeckyung Lee, Sorn Chottananurak, Taesik Gong, Sung-Ju Lee

TL;DR

Test-time adaptation (TTA) enables models to cope with domain shifts using unlabeled test data, but practical deployment is hampered by adaptation failures and the lack of ground-truth labels for monitoring performance. The authors introduce AETTA, a label-free accuracy estimator that leverages prediction disagreement between the adapted model and dropout inferences (PDD) to estimate post-adaptation accuracy, and they extend this with robust disagreement calibration to handle failures. The approach is theoretically grounded through disagreement-equality results and is empirically validated across CIFAR10-C, CIFAR100-C, and ImageNet-C against multiple TTA methods, achieving an average improvement of 19.8 percentage points in estimation accuracy. AETTA is shown to enable effective model monitoring and practical recovery strategies in dynamic, unlabeled test streams, with a public code release to encourage adoption.

Abstract

Test-time adaptation (TTA) has emerged as a viable solution to adapt pre-trained models to domain shifts using unlabeled test data. However, TTA faces challenges of adaptation failures due to its reliance on blind adaptation to unknown test samples in dynamic scenarios. Traditional methods for out-of-distribution performance estimation are limited by unrealistic assumptions in the TTA context, such as requiring labeled data or re-training models. To address this issue, we propose AETTA, a label-free accuracy estimation algorithm for TTA. We propose the prediction disagreement as the accuracy estimate, calculated by comparing the target model prediction with dropout inferences. We then improve the prediction disagreement to extend the applicability of AETTA under adaptation failures. Our extensive evaluation with four baselines and six TTA methods demonstrates that AETTA shows an average of 19.8%p more accurate estimation compared with the baselines. We further demonstrate the effectiveness of accuracy estimation with a model recovery case study, showcasing the practicality of our model recovery based on accuracy estimation. The source code is available at https://github.com/taeckyung/AETTA.

AETTA: Label-Free Accuracy Estimation for Test-Time Adaptation

TL;DR

Test-time adaptation (TTA) enables models to cope with domain shifts using unlabeled test data, but practical deployment is hampered by adaptation failures and the lack of ground-truth labels for monitoring performance. The authors introduce AETTA, a label-free accuracy estimator that leverages prediction disagreement between the adapted model and dropout inferences (PDD) to estimate post-adaptation accuracy, and they extend this with robust disagreement calibration to handle failures. The approach is theoretically grounded through disagreement-equality results and is empirically validated across CIFAR10-C, CIFAR100-C, and ImageNet-C against multiple TTA methods, achieving an average improvement of 19.8 percentage points in estimation accuracy. AETTA is shown to enable effective model monitoring and practical recovery strategies in dynamic, unlabeled test streams, with a public code release to encourage adoption.

Abstract

Test-time adaptation (TTA) has emerged as a viable solution to adapt pre-trained models to domain shifts using unlabeled test data. However, TTA faces challenges of adaptation failures due to its reliance on blind adaptation to unknown test samples in dynamic scenarios. Traditional methods for out-of-distribution performance estimation are limited by unrealistic assumptions in the TTA context, such as requiring labeled data or re-training models. To address this issue, we propose AETTA, a label-free accuracy estimation algorithm for TTA. We propose the prediction disagreement as the accuracy estimate, calculated by comparing the target model prediction with dropout inferences. We then improve the prediction disagreement to extend the applicability of AETTA under adaptation failures. Our extensive evaluation with four baselines and six TTA methods demonstrates that AETTA shows an average of 19.8%p more accurate estimation compared with the baselines. We further demonstrate the effectiveness of accuracy estimation with a model recovery case study, showcasing the practicality of our model recovery based on accuracy estimation. The source code is available at https://github.com/taeckyung/AETTA.
Paper Structure (51 sections, 2 theorems, 20 equations, 6 figures, 13 tables, 1 algorithm)

This paper contains 51 sections, 2 theorems, 20 equations, 6 figures, 13 tables, 1 algorithm.

Key Result

Theorem 3.1

If the hypothesis space ${\mathcal{H}}_{\mathcal{A}}$ and corresponding expectation function $\Tilde{h}$ satisfies dropout independence and confidence-prediction calibration, prediction disagreement with dropouts (PDD) approximates the test error over ${\mathcal{H}}_{\mathcal{A}}$:

Figures (6)

  • Figure 1: AETTA estimates the model's accuracy after adaptation using unlabeled test data without needing source data or ground-truth labels. AETTA can be integrated into existing TTA methods to estimate their accuracy under various scenarios.
  • Figure 2: Batch-wise accuracy, confidence, and prediction distribution when a model failed to adapt. TENT tent is used on CIFAR100-C with continually changing domains. The model becomes over-confident, and predictions are skewed.
  • Figure 3: Correlations between the confidence value of estimated expectation function $\Tilde{h}$ and (1) ground-truth accuracy (GroundTruth), (2) conditional probability $p(Y = k' | \Tilde{h}_{k'} (X) = q)$ of confidence-prediction calibration (CPC), and (3) robust confidence-prediction calibration (RCPC). We used six TTA methods in CIFAR100-C with continual domain changes. We observed accuracy degradation in TENT and EATA and improvement in SAR, CoTTA, RoTTA, and SoTTA. When models failed to adapt, the original CPC misaligned with the ground truth. In contrast, our WCPC dynamically scaled the probability $p$, thus showing better alignment.
  • Figure 4: Qualitative results on continual CIFAR10-C, CIFAR100-C, and ImageNet-C.
  • Figure 5: Impact of hyperparameters on the accuracy estimation performance.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Definition 3.1
  • Definition 3.2
  • Theorem 3.1: Disagreement Equality
  • Definition 3.3
  • Theorem 3.2: Robust Disagreement Equality