Evaluation of Test-Time Adaptation Under Computational Time Constraints
Motasem Alfarra, Hani Itani, Alejandro Pardo, Shyma Alhuwaider, Merey Ramazanova, Juan C. Pérez, Zhipeng Cai, Matthias Müller, Bernard Ghanem
TL;DR
This work addresses the gap between rapid advances in Test Time Adaptation (TTA) and their real-world applicability by introducing a realistic online evaluation that ties adaptation opportunities to the data stream speed. It formalizes a relative adaptation speed metric $\mathcal{C}(g)$ and an online protocol in which slower methods gain fewer adaptation opportunities, while fast methods adapt more frequently. Through extensive experiments across ImageNet-C, ImageNet-3DCC, and CIFAR10-C, the authors show that accounting for inference speed reshapes performance rankings, with simple, fast approaches (e.g., BN-based methods) often outperform slower, more complex ones like diffusion-based approaches. The findings stress the importance of designing TTA methods that balance accuracy with computational efficiency, guiding practical deployment in real-time systems.
Abstract
This paper proposes a novel online evaluation protocol for Test Time Adaptation (TTA) methods, which penalizes slower methods by providing them with fewer samples for adaptation. TTA methods leverage unlabeled data at test time to adapt to distribution shifts. Although many effective methods have been proposed, their impressive performance usually comes at the cost of significantly increased computation budgets. Current evaluation protocols overlook the effect of this extra computation cost, affecting their real-world applicability. To address this issue, we propose a more realistic evaluation protocol for TTA methods, where data is received in an online fashion from a constant-speed data stream, thereby accounting for the method's adaptation speed. We apply our proposed protocol to benchmark several TTA methods on multiple datasets and scenarios. Extensive experiments show that, when accounting for inference speed, simple and fast approaches can outperform more sophisticated but slower methods. For example, SHOT from 2020, outperforms the state-of-the-art method SAR from 2023 in this setting. Our results reveal the importance of developing practical TTA methods that are both accurate and efficient.
