Table of Contents
Fetching ...

On Pitfalls of Test-Time Adaptation

Hao Zhao, Yuejiang Liu, Alexandre Alahi, Tao Lin

TL;DR

This work introduces TTAB, an open-source benchmark for Test-Time Adaptation to enable rigorous, standardized evaluation across diverse distribution shifts. It demonstrates three practical pitfalls: hyperparameter sensitivity under online batch updates, strong dependence on pre-trained model quality, and limited effectiveness of existing TTA methods across certain shift families. By benchmarking ten methods over a broad set of shifts and providing two evaluation protocols, the study reveals that no method consistently solves all distribution shifts and that model selection and evaluation must consider batch dynamics. The findings motivate broader, more systematic evaluations and a re-examination of the empirical successes of TTA, with the TTAB codebase enabling ongoing, extensible research progress.

Abstract

Test-Time Adaptation (TTA) has recently emerged as a promising approach for tackling the robustness challenge under distribution shifts. However, the lack of consistent settings and systematic studies in prior literature hinders thorough assessments of existing methods. To address this issue, we present TTAB, a test-time adaptation benchmark that encompasses ten state-of-the-art algorithms, a diverse array of distribution shifts, and two evaluation protocols. Through extensive experiments, our benchmark reveals three common pitfalls in prior efforts. First, selecting appropriate hyper-parameters, especially for model selection, is exceedingly difficult due to online batch dependency. Second, the effectiveness of TTA varies greatly depending on the quality and properties of the model being adapted. Third, even under optimal algorithmic conditions, none of the existing methods are capable of addressing all common types of distribution shifts. Our findings underscore the need for future research in the field to conduct rigorous evaluations on a broader set of models and shifts, and to re-examine the assumptions behind the empirical success of TTA. Our code is available at \url{https://github.com/lins-lab/ttab}.

On Pitfalls of Test-Time Adaptation

TL;DR

This work introduces TTAB, an open-source benchmark for Test-Time Adaptation to enable rigorous, standardized evaluation across diverse distribution shifts. It demonstrates three practical pitfalls: hyperparameter sensitivity under online batch updates, strong dependence on pre-trained model quality, and limited effectiveness of existing TTA methods across certain shift families. By benchmarking ten methods over a broad set of shifts and providing two evaluation protocols, the study reveals that no method consistently solves all distribution shifts and that model selection and evaluation must consider batch dynamics. The findings motivate broader, more systematic evaluations and a re-examination of the empirical successes of TTA, with the TTAB codebase enabling ongoing, extensible research progress.

Abstract

Test-Time Adaptation (TTA) has recently emerged as a promising approach for tackling the robustness challenge under distribution shifts. However, the lack of consistent settings and systematic studies in prior literature hinders thorough assessments of existing methods. To address this issue, we present TTAB, a test-time adaptation benchmark that encompasses ten state-of-the-art algorithms, a diverse array of distribution shifts, and two evaluation protocols. Through extensive experiments, our benchmark reveals three common pitfalls in prior efforts. First, selecting appropriate hyper-parameters, especially for model selection, is exceedingly difficult due to online batch dependency. Second, the effectiveness of TTA varies greatly depending on the quality and properties of the model being adapted. Third, even under optimal algorithmic conditions, none of the existing methods are capable of addressing all common types of distribution shifts. Our findings underscore the need for future research in the field to conduct rigorous evaluations on a broader set of models and shifts, and to re-examine the assumptions behind the empirical success of TTA. Our code is available at \url{https://github.com/lins-lab/ttab}.
Paper Structure (47 sections, 13 figures, 9 tables, 1 algorithm)

This paper contains 47 sections, 13 figures, 9 tables, 1 algorithm.

Figures (13)

  • Figure 1: A generic formulation of distribution shifts, where $\mathcal{P}(a^{1:K})$ is characterized by some attributes, for instance, two image styles and one target label.
  • Figure 2: On the hyperparameter sensitivity of TTA methods, for evaluating the adaptation performance (test accuracy) of TENT and SHOT on CIFAR10-C (gaussian noise), under the combinations of learning rate and # of adaptation steps. The results indicate that the commonly used practice of selecting hyperparameters, e.g. setting the number of adaptation steps to $1$ while slightly varying the learning rate, does not necessarily lead to an improvement in test accuracy (it may even have detrimental effects). This phenomenon occurs in all corruption types.
  • Figure 3: The batch dependency issue during TTA and non-trivial model selection, for evaluating SHOT on CIFAR10-C (gaussian noise). Similar trends can be found in all corruption types. SHOT suffers a significant decline in performance in an online adaptation setting, particularly when improper hyperparameters are chosen. Despite efforts to improve adaptation performance through the implementation of multiple adaptation steps, the problem of batch dependency remains unresolved. Oracle model selection, while providing reliable label information to guide the adaptation process at test time, ultimately leads to even more severe dependency issues.
  • Figure 4: The impact of model quality on TTA performance, in terms of OOD v.s. OOD (TTA) on CIFAR10-C. We save the checkpoints from the pre-training phase of ResNet-26 with standard augmentation and evaluate TTA performance on these checkpoints using oracle model selection. The OOD generalization performance has a significant impact on the overall performance (i.e. averaged accuracy of all corruption types) of various TTA methods. Our analysis reveals a strong correlation between model quality and the effectiveness of TTA methods. Furthermore, certain TTA methods, specifically SHOT, may not provide an improvement in performance on OOD datasets and may even result in a decrease in performance when applied to models of low quality.
  • Figure 5: The impact of data augmentation policy on the TTA performance of the target domain. We save various sequences of checkpoints from the pre-training phase of ResNet-26 with five data augmentation policies and fine-tune each sequence to study the impact of data augmentation. TENT and SHOT use episodic adaptation with oracle model selection. Different data augmentation strategies have different corruption robustness, which causes varying generalization performance on CIFAR10-C. However, good practice in data augmentations and architecture designs for out-of-distribution generalization can be bad for test-time adaptation.
  • ...and 8 more figures