Table of Contents
Fetching ...

An Analysis of Model Robustness across Concurrent Distribution Shifts

Myeongho Jeon, Suhwan Choi, Hyoje Lee, Teresa Yeo

TL;DR

The work addresses robustness under complex, concurrent distribution shifts (ConDS) by introducing an attribute-driven evaluation framework that combines multiple UniDS types (SC, LDD, UDS) across multi-attribute datasets. It conducts an extensive study of 26 algorithms over 168 source–target pairs drawn from synthetic and real-world datasets, revealing that ConDS typically worsen performance relative to UniDS, yet improvements on one DS often generalize to others. A key finding is that heuristic data augmentations and pretraining consistently bolster robustness, while large foundation models in zero-shot settings are highly sensitive to prompts and underperform on real-world DSs. The framework and results offer a practical path toward more reliable deployment, with implications for designing robust systems under realistic, multi-faceted distribution shifts.

Abstract

Machine learning models, meticulously optimized for source data, often fail to predict target data when faced with distribution shifts (DSs). Previous benchmarking studies, though extensive, have mainly focused on simple DSs. Recognizing that DSs often occur in more complex forms in real-world scenarios, we broadened our study to include multiple concurrent shifts, such as unseen domain shifts combined with spurious correlations. We evaluated 26 algorithms that range from simple heuristic augmentations to zero-shot inference using foundation models, across 168 source-target pairs from eight datasets. Our analysis of over 100K models reveals that (i) concurrent DSs typically worsen performance compared to a single shift, with certain exceptions, (ii) if a model improves generalization for one distribution shift, it tends to be effective for others, and (iii) heuristic data augmentations achieve the best overall performance on both synthetic and real-world datasets.

An Analysis of Model Robustness across Concurrent Distribution Shifts

TL;DR

The work addresses robustness under complex, concurrent distribution shifts (ConDS) by introducing an attribute-driven evaluation framework that combines multiple UniDS types (SC, LDD, UDS) across multi-attribute datasets. It conducts an extensive study of 26 algorithms over 168 source–target pairs drawn from synthetic and real-world datasets, revealing that ConDS typically worsen performance relative to UniDS, yet improvements on one DS often generalize to others. A key finding is that heuristic data augmentations and pretraining consistently bolster robustness, while large foundation models in zero-shot settings are highly sensitive to prompts and underperform on real-world DSs. The framework and results offer a practical path toward more reliable deployment, with implications for designing robust systems under realistic, multi-faceted distribution shifts.

Abstract

Machine learning models, meticulously optimized for source data, often fail to predict target data when faced with distribution shifts (DSs). Previous benchmarking studies, though extensive, have mainly focused on simple DSs. Recognizing that DSs often occur in more complex forms in real-world scenarios, we broadened our study to include multiple concurrent shifts, such as unseen domain shifts combined with spurious correlations. We evaluated 26 algorithms that range from simple heuristic augmentations to zero-shot inference using foundation models, across 168 source-target pairs from eight datasets. Our analysis of over 100K models reveals that (i) concurrent DSs typically worsen performance compared to a single shift, with certain exceptions, (ii) if a model improves generalization for one distribution shift, it tends to be effective for others, and (iii) heuristic data augmentations achieve the best overall performance on both synthetic and real-world datasets.
Paper Structure (38 sections, 3 equations, 40 figures, 16 tables)

This paper contains 38 sections, 3 equations, 40 figures, 16 tables.

Figures (40)

  • Figure 1: Concurrent distribution shifts. Left: We list some attributes of a few images from the dSprites dataset. In this dataset, the object shape is the label. Center: We show how a single attribute e.g., the background color, can be used to create different types of distribution shifts. Namely, spurious correlation (SC), where in this example, the background color is correlated with the object shape, low data drift (LDD), and unseen data shift (UDS). We assume that the test data consists of images where all attribute instances are equally likely to appear, i.e., each image is generated by randomly selecting each attribute instance with equal probability. Right: As the real world consists of more complex shifts, we also make use of multiple attributes to create combinations of distribution shifts. In the examples above, we use the background and object color to create combinations of distribution shifts. The first shift is created using the background attribute and the second shift, the shape attribute i.e., SC+UDS is created from a correlation between the background color and the shape (SC), and only using a subset of colors, for the shape attribute (UDS).
  • Figure 2: dSprites samples. Even in simple synthetic data, multiple attributes can potentially lead to various DSs. Visualizations for other datasets are included in Figure \ref{['fig:dataset']} in the Section \ref{['supp:dataset']}.
  • Figure 3: Aggregate result on controlled datasets. We plot the change in accuracy compared to the base model, ResNet18, averaged across all seeds and controlled datasets with varying attributes. Blue indicates improved performance, while red indicates a decline. Each row is independent of the others. The models used for zero-shot inference were only used for evaluation, thus, they have the same absolute performance for each row. However, as we show their accuracies relative to the ResNet18, the relative performance for each model is not the same for each row. We fine-tune all the algorithms and report their optimal results. Augmentation methods and zero-shot models perform well under the different types of shifts. We provide a breakdown of the accuracy for each algorithm and dataset in the Section \ref{['supp:results']}.
  • Figure 4: Analysis of model robustness on distribution shifts.Left: Average performance of all generalization methods under different combinations of DSs. Right: Comparing the different generalization methods under an increasing number of DSs.
  • Figure 5: Samples of controlled datasets. We provide visualizations of some samples along with their attributes.
  • ...and 35 more figures