Evaluating Model Performance Under Worst-case Subpopulations
Mike Li, Daksh Mittal, Hongseok Namkoong, Shangzhou Xia
TL;DR
The paper defines worst-case subpopulation performance over core attributes $Z$ to certify distributional robustness before deployment. It develops a scalable two-stage estimation framework with a debiased, cross-fitted estimator based on a dual reformulation that makes tail-risk evaluation tractable even for high-dimensional $Z$. The authors provide asymptotic and finite-sample guarantees, including dimension-free concentration results, and validate the approach through simulations and case studies (Warfarin, ACS Income, FMoW), showing it can certify robustness and flag unreliable models while noting limitations for truly out-of-support shifts. They further connect this evaluation framework to coherent risk measures and distributionally robust optimization, highlighting implications for model assessment and data collection in practice.
Abstract
The performance of ML models degrades when the training population is different from that seen under operation. Towards assessing distributional robustness, we study the worst-case performance of a model over all subpopulations of a given size, defined with respect to core attributes Z. This notion of robustness can consider arbitrary (continuous) attributes Z, and automatically accounts for complex intersectionality in disadvantaged groups. We develop a scalable yet principled two-stage estimation procedure that can evaluate the robustness of state-of-the-art models. We prove that our procedure enjoys several finite-sample convergence guarantees, including dimension-free convergence. Instead of overly conservative notions based on Rademacher complexities, our evaluation error depends on the dimension of Z only through the out-of-sample error in estimating the performance conditional on Z. On real datasets, we demonstrate that our method certifies the robustness of a model and prevents deployment of unreliable models.
