Table of Contents
Fetching ...

Challenges and Opportunities in Improving Worst-Group Generalization in Presence of Spurious Features

Siddharth Joshi, Yu Yang, Yihao Xue, Wenhan Yang, Baharan Mirzasoleiman

TL;DR

This work tackles worst-group generalization under spurious correlations in settings with multiple classes and many groups, and with spurious features that may be slow to learn. It introduces SpuCoSun and SpuCoAnimals to stress-test group-inference and robust-training methods across 5 vision datasets, evaluating 8 SOTA approaches with over 5K models. The study reveals that existing GI methods deteriorate when spurious features are learned slowly, and that increasing the number of groups or classes exacerbates accuracy disparities and degrades average performance, while model selection becomes crucial and costly. It also proposes a cost-efficient model-selection strategy for group-inference methods by using inferred-group quality as a proxy, potentially enabling scalable evaluation and deployment of WG-robust approaches. The results chart a path toward more robust worst-group generalization under complex spurious correlations and offer valuable benchmarks and practical guidelines for future work.

Abstract

Deep neural networks often exploit *spurious* features that are present in the majority of examples within a class during training. This leads to *poor worst-group test accuracy*, i.e., poor accuracy for minority groups that lack these spurious features. Despite the growing body of recent efforts to address spurious correlations (SC), several challenging settings remain unexplored.In this work, we propose studying methods to mitigate SC in settings with: 1) spurious features that are learned more slowly, 2) a larger number of classes, and 3) a larger number of groups. We introduce two new datasets, Animals and SUN, to facilitate this study and conduct a systematic benchmarking of 8 state-of-the-art (SOTA) methods across a total of 5 vision datasets, training over 5,000 models. Through this, we highlight how existing group inference methods struggle in the presence of spurious features that are learned later in training. Additionally, we demonstrate how all existing methods struggle in settings with more groups and/or classes. Finally, we show the importance of careful model selection (hyperparameter tuning) in extracting optimal performance, especially in the more challenging settings we introduced, and propose more cost-efficient strategies for model selection. Overall, through extensive and systematic experiments, this work uncovers a suite of new challenges and opportunities for improving worst-group generalization in the presence of spurious features. Our datasets, methods and scripts available at https://github.com/BigML-CS-UCLA/SpuCo.

Challenges and Opportunities in Improving Worst-Group Generalization in Presence of Spurious Features

TL;DR

This work tackles worst-group generalization under spurious correlations in settings with multiple classes and many groups, and with spurious features that may be slow to learn. It introduces SpuCoSun and SpuCoAnimals to stress-test group-inference and robust-training methods across 5 vision datasets, evaluating 8 SOTA approaches with over 5K models. The study reveals that existing GI methods deteriorate when spurious features are learned slowly, and that increasing the number of groups or classes exacerbates accuracy disparities and degrades average performance, while model selection becomes crucial and costly. It also proposes a cost-efficient model-selection strategy for group-inference methods by using inferred-group quality as a proxy, potentially enabling scalable evaluation and deployment of WG-robust approaches. The results chart a path toward more robust worst-group generalization under complex spurious correlations and offer valuable benchmarks and practical guidelines for future work.

Abstract

Deep neural networks often exploit *spurious* features that are present in the majority of examples within a class during training. This leads to *poor worst-group test accuracy*, i.e., poor accuracy for minority groups that lack these spurious features. Despite the growing body of recent efforts to address spurious correlations (SC), several challenging settings remain unexplored.In this work, we propose studying methods to mitigate SC in settings with: 1) spurious features that are learned more slowly, 2) a larger number of classes, and 3) a larger number of groups. We introduce two new datasets, Animals and SUN, to facilitate this study and conduct a systematic benchmarking of 8 state-of-the-art (SOTA) methods across a total of 5 vision datasets, training over 5,000 models. Through this, we highlight how existing group inference methods struggle in the presence of spurious features that are learned later in training. Additionally, we demonstrate how all existing methods struggle in settings with more groups and/or classes. Finally, we show the importance of careful model selection (hyperparameter tuning) in extracting optimal performance, especially in the more challenging settings we introduced, and propose more cost-efficient strategies for model selection. Overall, through extensive and systematic experiments, this work uncovers a suite of new challenges and opportunities for improving worst-group generalization in the presence of spurious features. Our datasets, methods and scripts available at https://github.com/BigML-CS-UCLA/SpuCo.
Paper Structure (37 sections, 7 equations, 18 figures, 16 tables)

This paper contains 37 sections, 7 equations, 18 figures, 16 tables.

Figures (18)

  • Figure 1: Larger accuracy disparity is worse. 1) Slow-learnable spurious features are more challenging for group inference methods. 2) More groups are more challenging for all methods 3) More classes make it challenging to mantain high AVG while improving WG accuracy (seen as a strong negative correlation between AVG and WG and a large spread across methods)
  • Figure 2: Our Setting (Top): In spurious correlations, certain features (e.g. collar) are spuriously correlated with specific classes (dogs) e.g. Majority of dogs appear with collars and cats without collars, hence collars are spurious correlated with dogs. Bottom: In domain generalization, we train on one domain (e.g., real images) and test on unseen domains (e.g., cartoon images).
  • Figure 3: Comparing Average Accuracies of Majority and Minority Groups on SpuCoSun (Fast) v/s SpuCoSun
  • Figure 4: (Left) SpuCoSun (Fast) v/s SpuCoSun spurious feature, (Right) Both lead to poor WG accuracy with ERM, but group inference methods lag behind on improving WG Accuracy for slow-learnable spurious features
  • Figure 5: Large Fraction of Minority Inferred as Majority during Group Inference with Slow-Learnable Spurious Features
  • ...and 13 more figures