Table of Contents
Fetching ...

Towards Stable Test-Time Adaptation in Dynamic Wild World

Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, Mingkui Tan

TL;DR

This work tackles the instability of online test-time adaptation (TTA) under real-world wild test settings, including mixed domain shifts, small batch sizes, and online imbalanced label shifts. It first shows batch normalization as a key obstacle to stable adaptation and demonstrates that batch-agnostic norms (group norm and layer norm) improve stability but can still fail. The authors introduce SAR, a sharpness-aware and reliable entropy minimization method that filters out high-gradient, unreliable samples and optimizes towards flat minima, with a model recovery safeguard. Extensive experiments on ImageNet-C and related datasets show SAR outperforms state-of-the-art TTA methods in stability and efficiency across diverse wild test conditions. The findings suggest practical guidelines for deploying TTA: prefer GN/LN over BN and apply sharpness-aware, reliability-weighted entropy minimization for robust test-time updates.

Abstract

Test-time adaptation (TTA) has shown to be effective at tackling distribution shifts between training and testing data by adapting a given model on test samples. However, the online model updating of TTA may be unstable and this is often a key obstacle preventing existing TTA methods from being deployed in the real world. Specifically, TTA may fail to improve or even harm the model performance when test data have: 1) mixed distribution shifts, 2) small batch sizes, and 3) online imbalanced label distribution shifts, which are quite common in practice. In this paper, we investigate the unstable reasons and find that the batch norm layer is a crucial factor hindering TTA stability. Conversely, TTA can perform more stably with batch-agnostic norm layers, \ie, group or layer norm. However, we observe that TTA with group and layer norms does not always succeed and still suffers many failure cases. By digging into the failure cases, we find that certain noisy test samples with large gradients may disturb the model adaption and result in collapsed trivial solutions, \ie, assigning the same class label for all samples. To address the above collapse issue, we propose a sharpness-aware and reliable entropy minimization method, called SAR, for further stabilizing TTA from two aspects: 1) remove partial noisy samples with large gradients, 2) encourage model weights to go to a flat minimum so that the model is robust to the remaining noisy samples. Promising results demonstrate that SAR performs more stably over prior methods and is computationally efficient under the above wild test scenarios.

Towards Stable Test-Time Adaptation in Dynamic Wild World

TL;DR

This work tackles the instability of online test-time adaptation (TTA) under real-world wild test settings, including mixed domain shifts, small batch sizes, and online imbalanced label shifts. It first shows batch normalization as a key obstacle to stable adaptation and demonstrates that batch-agnostic norms (group norm and layer norm) improve stability but can still fail. The authors introduce SAR, a sharpness-aware and reliable entropy minimization method that filters out high-gradient, unreliable samples and optimizes towards flat minima, with a model recovery safeguard. Extensive experiments on ImageNet-C and related datasets show SAR outperforms state-of-the-art TTA methods in stability and efficiency across diverse wild test conditions. The findings suggest practical guidelines for deploying TTA: prefer GN/LN over BN and apply sharpness-aware, reliability-weighted entropy minimization for robust test-time updates.

Abstract

Test-time adaptation (TTA) has shown to be effective at tackling distribution shifts between training and testing data by adapting a given model on test samples. However, the online model updating of TTA may be unstable and this is often a key obstacle preventing existing TTA methods from being deployed in the real world. Specifically, TTA may fail to improve or even harm the model performance when test data have: 1) mixed distribution shifts, 2) small batch sizes, and 3) online imbalanced label distribution shifts, which are quite common in practice. In this paper, we investigate the unstable reasons and find that the batch norm layer is a crucial factor hindering TTA stability. Conversely, TTA can perform more stably with batch-agnostic norm layers, \ie, group or layer norm. However, we observe that TTA with group and layer norms does not always succeed and still suffers many failure cases. By digging into the failure cases, we find that certain noisy test samples with large gradients may disturb the model adaption and result in collapsed trivial solutions, \ie, assigning the same class label for all samples. To address the above collapse issue, we propose a sharpness-aware and reliable entropy minimization method, called SAR, for further stabilizing TTA from two aspects: 1) remove partial noisy samples with large gradients, 2) encourage model weights to go to a flat minimum so that the model is robust to the remaining noisy samples. Promising results demonstrate that SAR performs more stably over prior methods and is computationally efficient under the above wild test scenarios.
Paper Structure (28 sections, 7 equations, 11 figures, 18 tables, 1 algorithm)

This paper contains 28 sections, 7 equations, 11 figures, 18 tables, 1 algorithm.

Figures (11)

  • Figure 1: An illustration of practical/wild test-time adaptation (TTA) scenarios, in which prior online TTA methods may degrade severely. The accuracy of Tent wang2021tent is measured on ImageNet-C of level 5 with ResNet50-BN (15 mixed corruptions in (a) and Gaussian in (b-c)).
  • Figure 2: Failure case analyses (a-c) of online test-time entropy minimization wang2021tent. (a) and (b) record the model predictions during online adaptation. (c) illustrates how gradients norm evolves with and without model collapse. (d) investigates the relationship between the sample's entropy and gradients norm. All experiments are conducted on shuffled ImageNet-C of Gaussian noise with ResNet50 (GN), and a larger (severity) level denotes a more severe distribution shift.
  • Figure 3: Batch size effects of different TTA methods under different models (different normalization layers). Experiments are conducted on ImageNet-C of Gaussian noise. We report mean and standard deviation of 3 runs with different random seeds. 'na' denotes no adapt accuracy. Note that except for Vit-LN, the standard deviation is too small to display in the figures.
  • Figure 4: Performance of TTA methods on different models (different norm layers) under the mixture of 15 different corruption types (ImageNet-C). We report mean&stdev. over 3 independent runs.
  • Figure 5: Performance of TTA methods with different models (different norm layers) under online imbalanced label distribution shifts on ImageNet-C (Gaussian noise). We report mean&stdev. results of 3 runs. Note that except for VitBase-LN, the stdev. is too small to display in the figures.
  • ...and 6 more figures