Table of Contents
Fetching ...

BREEDS: Benchmarks for Subpopulation Shift

Shibani Santurkar, Dimitris Tsipras, Aleksander Madry

TL;DR

Breeds introduces a controllable framework for benchmarking robustness to subpopulation shift by reusing existing class hierarchies to define training and test subpopulations that are disjoint. It systematically constructs Breeds tasks from ImageNet by calibrating WordNet to align with visual similarity, and validates shifts with human studies. Empirically, standard models exhibit large performance drops under subpopulation shifts, and common robustness interventions yield only modest improvements. The work provides a general, data-efficient methodology for evaluating distributional robustness and offers practical insights into how to better generalize to unseen subpopulations.

Abstract

We develop a methodology for assessing the robustness of models to subpopulation shift---specifically, their ability to generalize to novel data subpopulations that were not observed during training. Our approach leverages the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions. This enables us to synthesize realistic distribution shifts whose sources can be precisely controlled and characterized, within existing large-scale datasets. Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity. We then validate that the corresponding shifts are tractable by obtaining human baselines for them. Finally, we utilize these benchmarks to measure the sensitivity of standard model architectures as well as the effectiveness of off-the-shelf train-time robustness interventions. Code and data available at https://github.com/MadryLab/BREEDS-Benchmarks .

BREEDS: Benchmarks for Subpopulation Shift

TL;DR

Breeds introduces a controllable framework for benchmarking robustness to subpopulation shift by reusing existing class hierarchies to define training and test subpopulations that are disjoint. It systematically constructs Breeds tasks from ImageNet by calibrating WordNet to align with visual similarity, and validates shifts with human studies. Empirically, standard models exhibit large performance drops under subpopulation shifts, and common robustness interventions yield only modest improvements. The work provides a general, data-efficient methodology for evaluating distributional robustness and offers practical insights into how to better generalize to unseen subpopulations.

Abstract

We develop a methodology for assessing the robustness of models to subpopulation shift---specifically, their ability to generalize to novel data subpopulations that were not observed during training. Our approach leverages the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions. This enables us to synthesize realistic distribution shifts whose sources can be precisely controlled and characterized, within existing large-scale datasets. Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity. We then validate that the corresponding shifts are tractable by obtaining human baselines for them. Finally, we utilize these benchmarks to measure the sensitivity of standard model architectures as well as the effectiveness of off-the-shelf train-time robustness interventions. Code and data available at https://github.com/MadryLab/BREEDS-Benchmarks .

Paper Structure

This paper contains 44 sections, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Illustration of our pipeline to create subpopulation shift benchmarks. Given a dataset, we first define superclasses by grouping semantically similar classes together to form a hierarchy. This allows us to treat the dataset labels as subpopulation annotations. Then, we construct a Breeds task of specified granularity (i.e., depth in the hierarchy) by posing the classification task in terms of superclasses at that depth and then partitioning their respective subpopulations into the source and target domains.
  • Figure 2: Sample images from random object categories for the Entity-13 and Living-17 tasks. For each task, the top and bottom row correspond to the source and target distributions respectively.
  • Figure 3: Human performance on (binary) Breeds tasks. Annotators are provided with labeled images from the source distribution for a pair of (undisclosed) superclasses, and asked to classify samples from the target domain ('T') into one of the two groups. As a baseline we also measure annotator performance without subpopulation shift (i.e., on test images drawn from the source domain, 'S') and equivalent tasks created via the original WordNet hierarchy (cf. Appendix \ref{['app:mturk']}). We can observe that across all tasks, annotators are fairly robust to subpopulation shift. Further, annotators consistently perform better on Breeds task compared to those based on WordNet directly---indicating that our modified class hierarchy is indeed better calibrated for object recognition. (We discuss model performance in Section \ref{['sec:eval']}.)
  • Figure 4: Robustness of standard models to Breeds subpopulation shifts. For each of the four tasks, we plot the accuracy of different (source domain-trained) model architectures (denoted by different symbols) on the target domain as a function of the source accuracy (which is typically high). We find that model accuracy drops significantly between domains (orange vs. dashed line). Still, models that are more accurate on the source domain seem to also be more robust (the improvements exceed the baseline (grey) which would correspond to a constant accuracy drop across models, i.e., $\frac{source \ acc}{target \ acc}$ = constant based on AlexNet). Moreover, the drop in model performance on the target domain can be reduced by retraining the final model layer with data from that domain (green). However, a non-trivial drop persists compared to both the original source accuracy, and target accuracy of models trained directly (end-to-end) on the target domain (blue).
  • Figure 5: Effect of train-time interventions on model robustness to subpopulation shift. We measure model performance in terms of relative accuracy--i.e., the ratio between its target and source accuracies. This allows us to visualize the accuracy-robustness trade-off along with the corresponding Pareto frontier (dashed). (Also shown are 95% confidence intervals computed via bootstrapping.) We observe that some of these interventions do improve model robustness to subpopulation shift by a small amount---specifically, erase noise and adversarial training---albeit sometimes at the cost of source accuracy.
  • ...and 8 more figures