Table of Contents
Fetching ...

Test-Time Adaptation to Distribution Shift by Confidence Maximization and Input Transformation

Chaithanya Kumar Mummadi, Robin Hutmacher, Kilian Rambach, Evgeny Levinkov, Thomas Brox, Jan Hendrik Metzen

TL;DR

The paper tackles robust test-time adaptation under distribution shift with only unlabeled target data. It introduces two non-saturating loss functions based on likelihood ratios (HLR and SLR) and a moving-average diversity regularizer to prevent trivial collapse, combined with a learnable input transformation module prepended to pretrained networks. The approach adapts only a subset of parameters and normalization statistics, enabling effective adaptation for ImageNet-C and ImageNet-R across several architectures. Empirically, it outperforms entropy-minimization baselines, demonstrates improved corruption robustness, and shows that a small amount of adaptation data can generalize to unseen target distributions. Overall, the method provides a practical, source-free framework for improving performance of pretrained classifiers in real-world distribution shifts.

Abstract

Deep neural networks often exhibit poor performance on data that is unlikely under the train-time data distribution, for instance data affected by corruptions. Previous works demonstrate that test-time adaptation to data shift, for instance using entropy minimization, effectively improves performance on such shifted distributions. This paper focuses on the fully test-time adaptation setting, where only unlabeled data from the target distribution is required. This allows adapting arbitrary pretrained networks. Specifically, we propose a novel loss that improves test-time adaptation by addressing both premature convergence and instability of entropy minimization. This is achieved by replacing the entropy by a non-saturating surrogate and adding a diversity regularizer based on batch-wise entropy maximization that prevents convergence to trivial collapsed solutions. Moreover, we propose to prepend an input transformation module to the network that can partially undo test-time distribution shifts. Surprisingly, this preprocessing can be learned solely using the fully test-time adaptation loss in an end-to-end fashion without any target domain labels or source domain data. We show that our approach outperforms previous work in improving the robustness of publicly available pretrained image classifiers to common corruptions on such challenging benchmarks as ImageNet-C.

Test-Time Adaptation to Distribution Shift by Confidence Maximization and Input Transformation

TL;DR

The paper tackles robust test-time adaptation under distribution shift with only unlabeled target data. It introduces two non-saturating loss functions based on likelihood ratios (HLR and SLR) and a moving-average diversity regularizer to prevent trivial collapse, combined with a learnable input transformation module prepended to pretrained networks. The approach adapts only a subset of parameters and normalization statistics, enabling effective adaptation for ImageNet-C and ImageNet-R across several architectures. Empirically, it outperforms entropy-minimization baselines, demonstrates improved corruption robustness, and shows that a small amount of adaptation data can generalize to unseen target distributions. Overall, the method provides a practical, source-free framework for improving performance of pretrained classifiers in real-world distribution shifts.

Abstract

Deep neural networks often exhibit poor performance on data that is unlikely under the train-time data distribution, for instance data affected by corruptions. Previous works demonstrate that test-time adaptation to data shift, for instance using entropy minimization, effectively improves performance on such shifted distributions. This paper focuses on the fully test-time adaptation setting, where only unlabeled data from the target distribution is required. This allows adapting arbitrary pretrained networks. Specifically, we propose a novel loss that improves test-time adaptation by addressing both premature convergence and instability of entropy minimization. This is achieved by replacing the entropy by a non-saturating surrogate and adding a diversity regularizer based on batch-wise entropy maximization that prevents convergence to trivial collapsed solutions. Moreover, we propose to prepend an input transformation module to the network that can partially undo test-time distribution shifts. Surprisingly, this preprocessing can be learned solely using the fully test-time adaptation loss in an end-to-end fashion without any target domain labels or source domain data. We show that our approach outperforms previous work in improving the robustness of publicly available pretrained image classifiers to common corruptions on such challenging benchmarks as ImageNet-C.

Paper Structure

This paper contains 16 sections, 3 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Illustration of different losses for confidence maximization. Losses (left, shifted such that maxima of all losses are at 0) and the resulting gradients with respect to the first logit (right) as a function of the first classes confidence are shown for the case of a binary classification problem. Both entropy and hard pseudo-labels have vanishing gradients for high confidence predictions. Accordingly, both have maximum gradient amplitude for low-confidence self-supervision, with this effect being stronger for the hard pseudo-labels. Hard Likelihood Ratio has constant gradient amplitude for any confidence and thus takes into account low- and high-confidence self-supervision equally. Soft Likelihood Ratio also shows non-vanishing (albeit non-maximum) gradients for high-confidence self-supervision and additionally produces small gradient amplitudes from low-confidence self-supervision. Since the likelihood ratio-based losses are unbounded, the design of the model needs to ensure that logits cannot grow unbounded.
  • Figure 2: Test-time adaptation results on (top row) ImageNet-C, averaged across all 15 corruptions and severities, (middle row) ImageNet-R, (bottom row) clean ImageNet. NA refers to "No Adaptation".
  • Figure 3: Test-time adaptation of ResNet50 using (top row) a subset of classes, and (bottom row) a subset of samples per class on 4 different corruptions at severity 5. Accuracy is computed based on the evaluation of adapted model on the entire target data. Note that error bars are smaller to visualize.
  • Figure A1: Structure of our adaptable model $g$, that comprises of $r_\psi$.
  • Figure A2: Effect of different $\kappa$ on both (a) HLR and (b) SLR