Evaluation of Test-Time Adaptation Under Computational Time Constraints

Motasem Alfarra; Hani Itani; Alejandro Pardo; Shyma Alhuwaider; Merey Ramazanova; Juan C. Pérez; Zhipeng Cai; Matthias Müller; Bernard Ghanem

Evaluation of Test-Time Adaptation Under Computational Time Constraints

Motasem Alfarra, Hani Itani, Alejandro Pardo, Shyma Alhuwaider, Merey Ramazanova, Juan C. Pérez, Zhipeng Cai, Matthias Müller, Bernard Ghanem

TL;DR

This work addresses the gap between rapid advances in Test Time Adaptation (TTA) and their real-world applicability by introducing a realistic online evaluation that ties adaptation opportunities to the data stream speed. It formalizes a relative adaptation speed metric $\mathcal{C}(g)$ and an online protocol in which slower methods gain fewer adaptation opportunities, while fast methods adapt more frequently. Through extensive experiments across ImageNet-C, ImageNet-3DCC, and CIFAR10-C, the authors show that accounting for inference speed reshapes performance rankings, with simple, fast approaches (e.g., BN-based methods) often outperform slower, more complex ones like diffusion-based approaches. The findings stress the importance of designing TTA methods that balance accuracy with computational efficiency, guiding practical deployment in real-time systems.

Abstract

This paper proposes a novel online evaluation protocol for Test Time Adaptation (TTA) methods, which penalizes slower methods by providing them with fewer samples for adaptation. TTA methods leverage unlabeled data at test time to adapt to distribution shifts. Although many effective methods have been proposed, their impressive performance usually comes at the cost of significantly increased computation budgets. Current evaluation protocols overlook the effect of this extra computation cost, affecting their real-world applicability. To address this issue, we propose a more realistic evaluation protocol for TTA methods, where data is received in an online fashion from a constant-speed data stream, thereby accounting for the method's adaptation speed. We apply our proposed protocol to benchmark several TTA methods on multiple datasets and scenarios. Extensive experiments show that, when accounting for inference speed, simple and fast approaches can outperform more sophisticated but slower methods. For example, SHOT from 2020, outperforms the state-of-the-art method SAR from 2023 in this setting. Our results reveal the importance of developing practical TTA methods that are both accurate and efficient.

Evaluation of Test-Time Adaptation Under Computational Time Constraints

TL;DR

and an online protocol in which slower methods gain fewer adaptation opportunities, while fast methods adapt more frequently. Through extensive experiments across ImageNet-C, ImageNet-3DCC, and CIFAR10-C, the authors show that accounting for inference speed reshapes performance rankings, with simple, fast approaches (e.g., BN-based methods) often outperform slower, more complex ones like diffusion-based approaches. The findings stress the importance of designing TTA methods that balance accuracy with computational efficiency, guiding practical deployment in real-time systems.

Abstract

Paper Structure (28 sections, 8 figures, 11 tables)

This paper contains 28 sections, 8 figures, 11 tables.

Introduction
Related Work
Test Time Adaptation.
Methodology
Current Protocol
Realistic Online Evaluation Protocol
Online computation of $\mathcal{C}(g)$.
Experiments
Episodic Evaluation of TTA
Continual Evaluation of TTA
Stream Speed Analysis
Results on Other Benchmarks and Architectures
Evaluation under Practical TTA
Effect of Hyper-parameter Tuning
Conclusions
...and 13 more sections

Figures (8)

Figure 1: The trend of average error rate using offline evaluation vs our proposed online evaluation. In the offline setup, TTA methods demonstrate progress across time with a decreasing average error rate, e.g. from $68.5\%$ using AdaBN to $56.2\%$ using SAR. We propose a realistic evaluation protocol that accounts for the adaptation speed of TTA methods. Under this protocol, fast methods (e.g. AdaBN) are unaffected, while slower (but more recent and sophisticated) methods (e.g. SAR) are penalized.
Figure 2: Inference under the current and realistic evaluation protocols. The current evaluation setting (left) assumes that the incoming batches of stream $\mathcal{S}$ can wait until the adaptation process of a TTA method $g$ finishes. This assumption is untenable in a real-time deployment scenario. Our proposed realistic evaluation (right) simulates a more realistic scenario where $\mathcal{S}$ reveals data at a constant speed. In this setup, slower TTA methods will adapt to a smaller portion of the stream. The remaining part of the stream will be predicted without adaptation by employing the most recent adapted model. We refer to the most recent adapted model as $f_{\theta_{t+1}}$, with $t$ denoting the time when the last sample was adapted to by $g$. When $g$ is still adapting to a sample, the incoming sample is fed to $f_{\theta_{t+1}}$ to produce predictions.
Figure 3: Continual Error Rate on ImageNet-C. We report the continual error rate of several TTA methods on ImageNet-C benchmark under both realistic and current setups. A lower error rate indicates a better TTA method. Continual evaluation means the corruptions are presented in a sequence without resetting the model in between. We choose the same order as presented along the x-axis; starting with brightness and ending with clean validation set. In the current setup, we observe an increasing trend for SHOT, TENT, and TTAC-NQ. This is hypothesized to be due to overfitting on the early distribution shifts. This behavior is mitigated in the realistic setup due to adapting to fewer batches. EATA and SAR perform equally well in both realistic and current continual setups due to sample rejection. We report the standard deviation across 3 seeds.
Figure 4: Average Error Rate on ImageNet-C Under Slower Stream Speeds. We report the average error rate for several TTA methods on ImageNet-C under slower stream speeds. In our proposed realistic model evaluation, the stream speed $r$ is normalized by the time needed for a forward pass using the base model. We evaluate different TTA methods under a stream with speed $\eta r$ with $\eta \in (0, 1]$. An $\eta=1/16$ means the stream is $16$ times slower than the forward pass of the base model. We report the standard deviation across 3 different random seeds. Different TTA methods degrade differently when varying $\eta$.
Figure 5: $\mathcal{C}(g)$ computation across iterations. We report our online calculations for the relative adaptation speed of $g$, $\mathcal{C}(g)$, for SAR, SHOT, EATA, and TENT throughout a full evaluation episode. We observe that, overall, $\mathcal{C}(g)$ has a stable behavior throughout evaluation iterations.
...and 3 more figures

Evaluation of Test-Time Adaptation Under Computational Time Constraints

TL;DR

Abstract

Evaluation of Test-Time Adaptation Under Computational Time Constraints

Authors

TL;DR

Abstract

Table of Contents

Figures (8)