Table of Contents
Fetching ...

Improving robustness against common corruptions by covariate shift adaptation

Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, Matthias Bethge

TL;DR

This work shows that many image corruptions induce covariate shift primarily in first- and second-order feature moments, which can be mitigated by unsupervised adaptation of Batch Normalization statistics computed on unlabeled corrupted data. The authors propose a simple yet effective baseline that combines training-time BN statistics with test-time estimates via a pseudo sample size N, enabling full, partial, or no adaptation depending on available data. Across 25 pretrained architectures and multiple robustness benchmarks, BN adaptation yields consistent performance gains, including substantial improvements on ImageNet-C (e.g., ResNet-50 from 76.7% to 62.2% mCE) and state-of-the-art results when combined with DeepAugment+AugMix. The findings argue for incorporating adapted statistics into corruption-benchmark reporting and suggest a practical, scalable path to robust vision systems, while also exploring limits under extreme or non-moment-based shifts and the impact of large-scale pre-training.

Abstract

Today's state-of-the-art machine vision models are vulnerable to image corruptions like blurring or compression artefacts, limiting their performance in many real-world applications. We here argue that popular benchmarks to measure model robustness against common corruptions (like ImageNet-C) underestimate model robustness in many (but not all) application scenarios. The key insight is that in many scenarios, multiple unlabeled examples of the corruptions are available and can be used for unsupervised online adaptation. Replacing the activation statistics estimated by batch normalization on the training set with the statistics of the corrupted images consistently improves the robustness across 25 different popular computer vision models. Using the corrected statistics, ResNet-50 reaches 62.2% mCE on ImageNet-C compared to 76.7% without adaptation. With the more robust DeepAugment+AugMix model, we improve the state of the art achieved by a ResNet50 model up to date from 53.6% mCE to 45.4% mCE. Even adapting to a single sample improves robustness for the ResNet-50 and AugMix models, and 32 samples are sufficient to improve the current state of the art for a ResNet-50 architecture. We argue that results with adapted statistics should be included whenever reporting scores in corruption benchmarks and other out-of-distribution generalization settings.

Improving robustness against common corruptions by covariate shift adaptation

TL;DR

This work shows that many image corruptions induce covariate shift primarily in first- and second-order feature moments, which can be mitigated by unsupervised adaptation of Batch Normalization statistics computed on unlabeled corrupted data. The authors propose a simple yet effective baseline that combines training-time BN statistics with test-time estimates via a pseudo sample size N, enabling full, partial, or no adaptation depending on available data. Across 25 pretrained architectures and multiple robustness benchmarks, BN adaptation yields consistent performance gains, including substantial improvements on ImageNet-C (e.g., ResNet-50 from 76.7% to 62.2% mCE) and state-of-the-art results when combined with DeepAugment+AugMix. The findings argue for incorporating adapted statistics into corruption-benchmark reporting and suggest a practical, scalable path to robust vision systems, while also exploring limits under extreme or non-moment-based shifts and the impact of large-scale pre-training.

Abstract

Today's state-of-the-art machine vision models are vulnerable to image corruptions like blurring or compression artefacts, limiting their performance in many real-world applications. We here argue that popular benchmarks to measure model robustness against common corruptions (like ImageNet-C) underestimate model robustness in many (but not all) application scenarios. The key insight is that in many scenarios, multiple unlabeled examples of the corruptions are available and can be used for unsupervised online adaptation. Replacing the activation statistics estimated by batch normalization on the training set with the statistics of the corrupted images consistently improves the robustness across 25 different popular computer vision models. Using the corrected statistics, ResNet-50 reaches 62.2% mCE on ImageNet-C compared to 76.7% without adaptation. With the more robust DeepAugment+AugMix model, we improve the state of the art achieved by a ResNet50 model up to date from 53.6% mCE to 45.4% mCE. Even adapting to a single sample improves robustness for the ResNet-50 and AugMix models, and 32 samples are sufficient to improve the current state of the art for a ResNet-50 architecture. We argue that results with adapted statistics should be included whenever reporting scores in corruption benchmarks and other out-of-distribution generalization settings.

Paper Structure

This paper contains 64 sections, 5 theorems, 43 equations, 11 figures, 12 tables.

Key Result

Proposition 1

We denote the source statistics as $\mu_s,\sigma_s^2$, the true target statistics as $\mu_t,\sigma^2_t$ and the biased estimates of the target statistics as $\hat{\mu}_t,\hat{\sigma}_t^2$. For normalization, we take a convex combination of the source statistics and estimated target statistics as dis The quantity $\chi^2_{1-\alpha/2, n-1}$ denotes the left tail value of a chi square distribution wi

Figures (11)

  • Figure 1: The Wasserstein metric between optimal source (IN) and target (IN-C) statistics correlates well with top-1 errors (i) of non-adapted models on IN-C, (ii) of adapted models on IN-C, indicating that even after reducing covariate shift, the metric is predictive of the remaining source--target mismatch (iii) IN-C adapted models on IN, the reverse case of (i). Holdout corruptions can be used to get a linear estimate on the prediction error of test corruptions (tables). We depict input and downsample (iv) as well as bottlneck layers (v) and notice the largest shift in early and late downsampling layers. The metric is either averaged across layers (i--iii) or across corruptions (iv--v).
  • Figure 2: Batch size vs. performance trade-off for different natural image datasets with no covariate shift (IN, IN-V2), complex and shuffled covariate shift (ObjectNet), complex and systematic covariate shift (ImageNet-R). Straight black lines show baseline performance (no adaptation). ImageNet plotted for reference.
  • Figure 3: The bound suggests small optimal $N$ for most parameters (i) and qualitatively explains our empirical observation (ii).
  • Figure 4: Wasserstein distance, normalized Wasserstein distance and Jeffrey divergence estimated among source and target statistics between different network layers. We report the respective metric w.r.t. to the difference between baseline (IN) and target (IN-C) statistics and show the value averaged across all corruptions. We note that for a ResNet-50 model, downsampling layers contribute most to the overall error.
  • Figure 5: Normalized Wasserstein distance and Jeffrey divergence across corruptions and layers in a ResNet-50.
  • ...and 6 more figures

Theorems & Definitions (8)

  • Definition 1: Covariate Shift, cf. sugiyama2012machineschoelkopf2012causal
  • Proposition 1: Bounds on the expected value of the Wasserstein distance between target and combined estimated target and source statistics
  • Proposition 1: Bounds on the expected value of the Wasserstein distance between target and combined estimated target and source statistics
  • Lemma 1: Mean and variance of sample moments, following weisstein
  • Lemma 2: Holder's defect formula for concave functions in probabilistic notation, following becker2012variance
  • Lemma 3: Upper and lower bounds on the expectation value of $\Bar{\sigma}$
  • proof
  • proof