Table of Contents
Fetching ...

A Fine-Grained Analysis on Distribution Shift

Olivia Wiles, Sven Gowal, Florian Stimberg, Sylvestre Alvise-Rebuffi, Ira Ktena, Krishnamurthy Dvijotham, Taylan Cemgil

TL;DR

The paper presents a principled framework to study robustness to distribution shifts by decomposing data into latent factors and defining three core shifts (spurious correlation, low-data drift, unseen data shift) plus two conditions (label noise, dataset size). It benchmarks 19 methods spanning architectures, augmentations, domain generalization, adaptive strategies, and representation learning across six datasets, showing that pretraining and learned augmentations frequently aid generalization, though no single method is universally best across all shifts. The work emphasizes the need for fine-grained, context-aware evaluation and provides practical tips for practitioners while highlighting directions for future research in robust generalization. Its modular framework and extensive results aim to guide method selection and encourage extensible benchmarking in real-world settings.

Abstract

Robustness to distribution shifts is critical for deploying machine learning models in the real world. Despite this necessity, there has been little work in defining the underlying mechanisms that cause these shifts and evaluating the robustness of algorithms across multiple, different distribution shifts. To this end, we introduce a framework that enables fine-grained analysis of various distribution shifts. We provide a holistic analysis of current state-of-the-art methods by evaluating 19 distinct methods grouped into five categories across both synthetic and real-world datasets. Overall, we train more than 85K models. Our experimental framework can be easily extended to include new methods, shifts, and datasets. We find, unlike previous work~\citep{Gulrajani20}, that progress has been made over a standard ERM baseline; in particular, pretraining and augmentations (learned or heuristic) offer large gains in many cases. However, the best methods are not consistent over different datasets and shifts.

A Fine-Grained Analysis on Distribution Shift

TL;DR

The paper presents a principled framework to study robustness to distribution shifts by decomposing data into latent factors and defining three core shifts (spurious correlation, low-data drift, unseen data shift) plus two conditions (label noise, dataset size). It benchmarks 19 methods spanning architectures, augmentations, domain generalization, adaptive strategies, and representation learning across six datasets, showing that pretraining and learned augmentations frequently aid generalization, though no single method is universally best across all shifts. The work emphasizes the need for fine-grained, context-aware evaluation and provides practical tips for practitioners while highlighting directions for future research in robust generalization. Its modular framework and extensive results aim to guide method selection and encourage extensible benchmarking in real-world settings.

Abstract

Robustness to distribution shifts is critical for deploying machine learning models in the real world. Despite this necessity, there has been little work in defining the underlying mechanisms that cause these shifts and evaluating the robustness of algorithms across multiple, different distribution shifts. To this end, we introduce a framework that enables fine-grained analysis of various distribution shifts. We provide a holistic analysis of current state-of-the-art methods by evaluating 19 distinct methods grouped into five categories across both synthetic and real-world datasets. Overall, we train more than 85K models. Our experimental framework can be easily extended to include new methods, shifts, and datasets. We find, unlike previous work~\citep{Gulrajani20}, that progress has been made over a standard ERM baseline; in particular, pretraining and augmentations (learned or heuristic) offer large gains in many cases. However, the best methods are not consistent over different datasets and shifts.

Paper Structure

This paper contains 85 sections, 4 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Visualization of the joint distribution for the different shifts we consider on the dSprites example. The lighter the color, the more likely the given sample. figure \ref{['fig:shift:SC']}-\ref{['fig:shift:SG']} visualise different shifts over $p_{\rm{train}}(y^{l}, y^{a})$ discussed in \ref{['sec:shifts']}: spurious correlation (SC), low-data drift (LDD), and unseen data shift (UDS). figure \ref{['fig:shift:IID']} visualises the test set, where the attributes are uniformly distributed.
  • Figure 2: Dataset samples. Each row fixes an attribute (e.g. color for dSprites, MPI3D, Shapes3D; azimuth for SmallNorb; hospital for Camelyon17; and location for iWildCam).
  • Figure 3: Spurious Correlation. We use all correlated samples and vary the number of samples $N$ from the true, uncorrelated distribution. We plot the percentage change over the baseline ResNet, averaged over all seeds and datasets. Blue is better, red worse. CycleGAN performs consistently best while ImageNet augmentation and pretraining on ImageNet also consistently boosts performance.
  • Figure 4: Low-data drift. We use all samples from the high data regions and vary the number of samples $N$ from the low-data region. We plot the percentage change over the baseline ResNet, averaged over all seeds and datasets. Blue is better, red worse. Pretraining on ImageNet performs consistently best, while CycleGAN, most domain generalization methods and ImageNet augmentation also provide some boost in performance.
  • Figure 5: Unseen data shift. We rank the methods (where best is $1$, worst $19$) for each dataset and seed and plot the rankings, with the overall median rank as the black bar. Pretraining on ImageNet and ImageNet augmentation perform consistently best. DANN, CycleGAN and other heuristic augmentations perform consistently well.
  • ...and 13 more figures