RDumb: A simple approach that questions our progress in continual test-time adaptation

Ori Press; Steffen Schneider; Matthias Kümmerer; Matthias Bethge

RDumb: A simple approach that questions our progress in continual test-time adaptation

Ori Press, Steffen Schneider, Matthias Kümmerer, Matthias Bethge

TL;DR

The Continuously Changing Corruptions (CCC) benchmark is proposed to measure asymptotic performance of TTA techniques and shows that previous TTA approaches are neither effective at regularizing adaptation to avoid collapse nor able to outperform a simplistic resetting strategy.

Abstract

Test-Time Adaptation (TTA) allows to update pre-trained models to changing data distributions at deployment time. While early work tested these algorithms for individual fixed distribution shifts, recent work proposed and applied methods for continual adaptation over long timescales. To examine the reported progress in the field, we propose the Continually Changing Corruptions (CCC) benchmark to measure asymptotic performance of TTA techniques. We find that eventually all but one state-of-the-art methods collapse and perform worse than a non-adapting model, including models specifically proposed to be robust to performance collapse. In addition, we introduce a simple baseline, "RDumb", that periodically resets the model to its pretrained state. RDumb performs better or on par with the previously proposed state-of-the-art in all considered benchmarks. Our results show that previous TTA approaches are neither effective at regularizing adaptation to avoid collapse nor able to outperform a simplistic resetting strategy.

RDumb: A simple approach that questions our progress in continual test-time adaptation

TL;DR

Abstract

Paper Structure (26 sections, 2 equations, 11 figures, 9 tables, 2 algorithms)

This paper contains 26 sections, 2 equations, 11 figures, 9 tables, 2 algorithms.

Introduction
CCC: Towards Infinite Testing with Continuously Changing Corruptions
Continuously changing image corruptions
Calibration to desired baseline accuracy
Generating Benchmark Runs
RDumb: Turning your model off and on again
Experiment Setup
Results
Analysis and Ablations
Discussion and Related Work
Conclusion
2D Example Experiments and Analysis
Data.
Model.
Model training.
...and 11 more sections

Figures (11)

Figure 1: Continuously Changing Corruptions show limitations of existing TTA methods. (a) Comparison between ImageNet-Val, CIN-C and CCC. The proposed version of CCC is 10$\times$ longer than CIN-C and could naturally be extended even further without repeating images. CCC consists of sequences of smooth transitions from one ImageNet-C noise to another one. For each such pair, we construct a trajectory continuously interpolating from one pure noise to the other pure noise such that baseline accuracy is kept constant. For each point along the trajectory, we sample a batch of 1k, 2k, or 5k images from ImageNet-Val, randomly crop and flip it and apply the noise combination. (b) Due to its short length and high variability in difficulty, CIN-C (top) is unable to reveal the collapse of methods such as ETA and CoTTA, while CCC (middle and bottom) can.
Figure 2: (a) Each corruption of CCC consists of applying two ImageNet-C corruptions at different severities. We extend the individual severities to be more fine-grained than in ImageNet-C, allowing for smoother noise changes, and exponentially more (noise, severity) combinations. The corners are enlarged for easier viewing, zoom in for greater detail. (b) Sample dataset sequences with a constant baseline accuracy. The sequences start from the left where Motion Blur is zeroed out, and end at the top with Gaussian noise zeroed out. The colors red, orange, and yellow correspond to trajectories in CCC-Easy, CCC-Medium and CCC-Hard, respectively.
Figure 3: Adaptation performance of all evaluated models depending on the number of observed samples so far. (a) CIN-C. Model performances are averaged over the 10 runs of the benchmark. (b) CCC. Model performances are averaged over the 27 runs of the three difficulty levels. See Appendix \ref{['apdx:ccc-plot']}, Figure \ref{['fig:appdx-CCC-levels']} for separate plots for CCC Easy, Medium and Hard.
Figure 4: TTA using a ViT backbone: (a) On CIN-C, EATA is better than the pretrained baseline (44.4% points vs 40.1% points). (b) On CCC-Medium, EATA is worse than the pretrained baseline (38.5% points vs 42.0% points). RDumb (ours) is consistently better than both EATA and the baseline.
Figure 5: (a) ETA's normalized accuracy over time, on the ImageNet-C holdout noises and each of their severities. For every noise in the holdout set, ETA reaches its maximum accuracy very quickly. (b) Rdumb shares ETA's property of fast adaptation, while regularization in EATA slows adaptation.
...and 6 more figures

RDumb: A simple approach that questions our progress in continual test-time adaptation

TL;DR

Abstract

RDumb: A simple approach that questions our progress in continual test-time adaptation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)