In Search of Lost Online Test-time Adaptation: A Survey

Zixin Wang; Yadan Luo; Liang Zheng; Zhuoxiao Chen; Sen Wang; Zi Huang

In Search of Lost Online Test-time Adaptation: A Survey

Zixin Wang, Yadan Luo, Liang Zheng, Zhuoxiao Chen, Sen Wang, Zi Huang

TL;DR

This survey addresses online test-time adaptation (OTTA) by proposing a three-category taxonomy—optimization-based, data-based, and model-based—and benchmarking eight representative OTTA methods on Vision Transformers (ViT). It introduces a diverse set of real-world and corrupted testbeds, including CIFAR-10/100-C, ImageNet-C, CIFAR-10-Warehouse, and OfficeHome, and evaluates both accuracy and efficiency through GFLOPs, wall-clock time, and GPU memory usage. The key findings reveal that Transformers show greater resilience to domain shifts, many OTTA methods rely on sizeable test batches for best performance, and stability against perturbations is crucial when batch size is 1. The paper provides concrete guidance on deploying OTTA with ViT, highlights the importance of normalization strategies and memory banks, and suggests directions such as prompting and multimodal extension to future research, with code available for reproducibility.

Abstract

This article presents a comprehensive survey of online test-time adaptation (OTTA), focusing on effectively adapting machine learning models to distributionally different target data upon batch arrival. Despite the recent proliferation of OTTA methods, conclusions from previous studies are inconsistent due to ambiguous settings, outdated backbones, and inconsistent hyperparameter tuning, which obscure core challenges and hinder reproducibility. To enhance clarity and enable rigorous comparison, we classify OTTA techniques into three primary categories and benchmark them using a modern backbone, the Vision Transformer (ViT). Our benchmarks cover conventional corrupted datasets such as CIFAR-10/100-C and ImageNet-C, as well as real-world shifts represented by CIFAR-10.1, OfficeHome, and CIFAR-10-Warehouse. The CIFAR-10-Warehouse dataset includes a variety of variations from different search engines and synthesized data generated through diffusion models. To measure efficiency in online scenarios, we introduce novel evaluation metrics, including GFLOPs, wall clock time, and GPU memory usage, providing a clearer picture of the trade-offs between adaptation accuracy and computational overhead. Our findings diverge from existing literature, revealing that (1) transformers demonstrate heightened resilience to diverse domain shifts, (2) the efficacy of many OTTA methods relies on large batch sizes, and (3) stability in optimization and resistance to perturbations are crucial during adaptation, particularly when the batch size is 1. Based on these insights, we highlight promising directions for future research. Our benchmarking toolkit and source code are available at https://github.com/Jo-wang/OTTA_ViT_survey.

In Search of Lost Online Test-time Adaptation: A Survey

TL;DR

Abstract

In Search of Lost Online Test-time Adaptation: A Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (15)