An Analysis of Model Robustness across Concurrent Distribution Shifts
Myeongho Jeon, Suhwan Choi, Hyoje Lee, Teresa Yeo
TL;DR
The work addresses robustness under complex, concurrent distribution shifts (ConDS) by introducing an attribute-driven evaluation framework that combines multiple UniDS types (SC, LDD, UDS) across multi-attribute datasets. It conducts an extensive study of 26 algorithms over 168 source–target pairs drawn from synthetic and real-world datasets, revealing that ConDS typically worsen performance relative to UniDS, yet improvements on one DS often generalize to others. A key finding is that heuristic data augmentations and pretraining consistently bolster robustness, while large foundation models in zero-shot settings are highly sensitive to prompts and underperform on real-world DSs. The framework and results offer a practical path toward more reliable deployment, with implications for designing robust systems under realistic, multi-faceted distribution shifts.
Abstract
Machine learning models, meticulously optimized for source data, often fail to predict target data when faced with distribution shifts (DSs). Previous benchmarking studies, though extensive, have mainly focused on simple DSs. Recognizing that DSs often occur in more complex forms in real-world scenarios, we broadened our study to include multiple concurrent shifts, such as unseen domain shifts combined with spurious correlations. We evaluated 26 algorithms that range from simple heuristic augmentations to zero-shot inference using foundation models, across 168 source-target pairs from eight datasets. Our analysis of over 100K models reveals that (i) concurrent DSs typically worsen performance compared to a single shift, with certain exceptions, (ii) if a model improves generalization for one distribution shift, it tends to be effective for others, and (iii) heuristic data augmentations achieve the best overall performance on both synthetic and real-world datasets.
