BREEDS: Benchmarks for Subpopulation Shift
Shibani Santurkar, Dimitris Tsipras, Aleksander Madry
TL;DR
Breeds introduces a controllable framework for benchmarking robustness to subpopulation shift by reusing existing class hierarchies to define training and test subpopulations that are disjoint. It systematically constructs Breeds tasks from ImageNet by calibrating WordNet to align with visual similarity, and validates shifts with human studies. Empirically, standard models exhibit large performance drops under subpopulation shifts, and common robustness interventions yield only modest improvements. The work provides a general, data-efficient methodology for evaluating distributional robustness and offers practical insights into how to better generalize to unseen subpopulations.
Abstract
We develop a methodology for assessing the robustness of models to subpopulation shift---specifically, their ability to generalize to novel data subpopulations that were not observed during training. Our approach leverages the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions. This enables us to synthesize realistic distribution shifts whose sources can be precisely controlled and characterized, within existing large-scale datasets. Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity. We then validate that the corresponding shifts are tractable by obtaining human baselines for them. Finally, we utilize these benchmarks to measure the sensitivity of standard model architectures as well as the effectiveness of off-the-shelf train-time robustness interventions. Code and data available at https://github.com/MadryLab/BREEDS-Benchmarks .
