Table of Contents
Fetching ...

Realistic Evaluation of Test-Time Adaptation Algorithms: Unsupervised Hyperparameter Selection

Sebastian Cygert, Damian Sójka, Tomasz Trzciński, Bartłomiej Twardowski

TL;DR

This paper tackles the realism gap in evaluating Test-Time Adaptation (TTA) by studying unsupervised hyperparameter selection strategies that do not rely on test labels. It demonstrates that model selection choices can drastically alter TTA performance, with no single unsupervised metric consistently matching oracle-based selection across diverse benchmarks. The analysis reveals that supervision (even a small amount of labeled target data or access to source data) yields robust hyperparameter choices, while purely unsupervised strategies often struggle, especially under extended adaptation. The work highlights the forgetting problem and the need for rigorous benchmarking, and provides open-source code and a testbed to standardize future evaluations and improve practical reliability of TTA methods.

Abstract

Test-Time Adaptation (TTA) has recently emerged as a promising strategy for tackling the problem of machine learning model robustness under distribution shifts by adapting the model during inference without access to any labels. Because of task difficulty, hyperparameters strongly influence the effectiveness of adaptation. However, the literature has provided little exploration into optimal hyperparameter selection. In this work, we tackle this problem by evaluating existing TTA methods using surrogate-based hp-selection strategies (which do not assume access to the test labels) to obtain a more realistic evaluation of their performance. We show that some of the recent state-of-the-art methods exhibit inferior performance compared to the previous algorithms when using our more realistic evaluation setup. Further, we show that forgetting is still a problem in TTA as the only method that is robust to hp-selection resets the model to the initial state at every step. We analyze different types of unsupervised selection strategies, and while they work reasonably well in most scenarios, the only strategies that work consistently well use some kind of supervision (either by a limited number of annotated test samples or by using pretraining data). Our findings underscore the need for further research with more rigorous benchmarking by explicitly stating model selection strategies, to facilitate which we open-source our code.

Realistic Evaluation of Test-Time Adaptation Algorithms: Unsupervised Hyperparameter Selection

TL;DR

This paper tackles the realism gap in evaluating Test-Time Adaptation (TTA) by studying unsupervised hyperparameter selection strategies that do not rely on test labels. It demonstrates that model selection choices can drastically alter TTA performance, with no single unsupervised metric consistently matching oracle-based selection across diverse benchmarks. The analysis reveals that supervision (even a small amount of labeled target data or access to source data) yields robust hyperparameter choices, while purely unsupervised strategies often struggle, especially under extended adaptation. The work highlights the forgetting problem and the need for rigorous benchmarking, and provides open-source code and a testbed to standardize future evaluations and improve practical reliability of TTA methods.

Abstract

Test-Time Adaptation (TTA) has recently emerged as a promising strategy for tackling the problem of machine learning model robustness under distribution shifts by adapting the model during inference without access to any labels. Because of task difficulty, hyperparameters strongly influence the effectiveness of adaptation. However, the literature has provided little exploration into optimal hyperparameter selection. In this work, we tackle this problem by evaluating existing TTA methods using surrogate-based hp-selection strategies (which do not assume access to the test labels) to obtain a more realistic evaluation of their performance. We show that some of the recent state-of-the-art methods exhibit inferior performance compared to the previous algorithms when using our more realistic evaluation setup. Further, we show that forgetting is still a problem in TTA as the only method that is robust to hp-selection resets the model to the initial state at every step. We analyze different types of unsupervised selection strategies, and while they work reasonably well in most scenarios, the only strategies that work consistently well use some kind of supervision (either by a limited number of annotated test samples or by using pretraining data). Our findings underscore the need for further research with more rigorous benchmarking by explicitly stating model selection strategies, to facilitate which we open-source our code.
Paper Structure (18 sections, 14 figures, 7 tables)

This paper contains 18 sections, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Comparison of different model selection strategies. Values denote the number of experiments (this includes different adaptation methods and datasets) on which the row model selection method outperforms or matches the column method. No method does consistently better than any other method across all setups or match the ORACLE model selection strategy
  • Figure 2: TTA results using different hyperparameter selection strategies aggregated over 5 datasets. There is a varying gap between Oracle and unsupervised selection strategies. Using cat strategy to guide selection works fairly well, whereas using the obj strategy performs poorly.
  • Figure 3: Average ranking of TTA methods under different model selection strategies. Under the proposed hyperparameter selection procedure, simple methods (TENT) or those that are robust to changes in hyperparameters (MEMO) score more favorably compared to the oracle selection. Methods are ordered by a decreasing performance of oracle selection.
  • Figure 4: The varying gap between surrogate-based selection strategies and the oracle selection. The MEMO method is robust to changes in hyperparameters, and therefore, it simplifies hyperparameter selection.
  • Figure 5: Accuracy gap when testing the methods on longer adaptation scenarios (results with "-L" suffix). s-acc seems to be the most stable measure metric. AdaContrast is severely affected on longer scenarios.
  • ...and 9 more figures