Evaluating Model Performance Under Worst-case Subpopulations

Mike Li; Daksh Mittal; Hongseok Namkoong; Shangzhou Xia

Evaluating Model Performance Under Worst-case Subpopulations

Mike Li, Daksh Mittal, Hongseok Namkoong, Shangzhou Xia

TL;DR

The paper defines worst-case subpopulation performance over core attributes $Z$ to certify distributional robustness before deployment. It develops a scalable two-stage estimation framework with a debiased, cross-fitted estimator based on a dual reformulation that makes tail-risk evaluation tractable even for high-dimensional $Z$. The authors provide asymptotic and finite-sample guarantees, including dimension-free concentration results, and validate the approach through simulations and case studies (Warfarin, ACS Income, FMoW), showing it can certify robustness and flag unreliable models while noting limitations for truly out-of-support shifts. They further connect this evaluation framework to coherent risk measures and distributionally robust optimization, highlighting implications for model assessment and data collection in practice.

Abstract

The performance of ML models degrades when the training population is different from that seen under operation. Towards assessing distributional robustness, we study the worst-case performance of a model over all subpopulations of a given size, defined with respect to core attributes Z. This notion of robustness can consider arbitrary (continuous) attributes Z, and automatically accounts for complex intersectionality in disadvantaged groups. We develop a scalable yet principled two-stage estimation procedure that can evaluate the robustness of state-of-the-art models. We prove that our procedure enjoys several finite-sample convergence guarantees, including dimension-free convergence. Instead of overly conservative notions based on Rademacher complexities, our evaluation error depends on the dimension of Z only through the out-of-sample error in estimating the performance conditional on Z. On real datasets, we demonstrate that our method certifies the robustness of a model and prevents deployment of unreliable models.

Evaluating Model Performance Under Worst-case Subpopulations

TL;DR

The paper defines worst-case subpopulation performance over core attributes

to certify distributional robustness before deployment. It develops a scalable two-stage estimation framework with a debiased, cross-fitted estimator based on a dual reformulation that makes tail-risk evaluation tractable even for high-dimensional

. The authors provide asymptotic and finite-sample guarantees, including dimension-free concentration results, and validate the approach through simulations and case studies (Warfarin, ACS Income, FMoW), showing it can certify robustness and flag unreliable models while noting limitations for truly out-of-support shifts. They further connect this evaluation framework to coherent risk measures and distributionally robust optimization, highlighting implications for model assessment and data collection in practice.

Abstract

Paper Structure (55 sections, 26 theorems, 146 equations, 21 figures, 6 tables, 1 algorithm)

This paper contains 55 sections, 26 theorems, 146 equations, 21 figures, 6 tables, 1 algorithm.

Introduction
Related work.
Methodology
Dual reformulation
Estimation
Plug-in estimator
Debiasing the plug-in estimator
Cross-fitting procedure
Empirical comparison between plug-in and debiased estimators
Comparison with SubbaswamyAdSa21
Convergence guarantees
Asymptotics
Concentration using the localized Rademacher complexity
Data-dependent dimension-free concentration
Extensions for heavy-tailed loss functions
...and 40 more sections

Key Result

Lemma 1

If $\mathbb{E}[h(Z)_+] < \infty$, then for $\alpha\in(0,1)$, The infimum is attained at $\eta = P_{1-\alpha}^{-1}(\mu)$. Moreover, if $h(Z)$ has no probability mass at $P_{1-\alpha}^{-1}(h)$, then $\mathsf{W}_{\alpha}(h) = \mathbb{E}[h(Z)\,|\, h(Z) \ge P_{1-\alpha}^{-1}(h)]$.

Figures (21)

Figure 1: Conditional risk $\mu(Z) = \mathbb{E}[(Y - \theta(X))^2 \mid Z]$. Here $Z$ = age on the left panel, $Z$ = race in the center, and $Z$ = (age, race) on the right. A = Asian, B = Black, U = Unknown, W = White.
Figure 2: Debiasing estimator yields substantially lower MSE and rapidly gains relative efficiency as $n$ grows. Improvement is reported as $(\text{MSE}_\text{plug-in} - \text{MSE}_\text{debiased})/\text{MSE}_\text{plug-in}$.
Figure 3: Debiasing sharply reduces bias without inflating variance. Bias improvements exceed $2\times$ at $n=10^2$ and remain at least $6\times$ through $n = 10^5$, while the variance stays comparable to the plug-in baseline.
Figure 4: $\widehat{\mathsf{W}}_{\alpha, k}(\widehat{h}_1)$ and $\mathsf{W}_{\alpha}(\theta)$ from simulation experiments with $\alpha=0.3$
Figure 5: Worst-case subpopulation performance $\mathsf{W}_{\alpha}(\theta)$, where $\mathsf{W}_{1.0}(\theta) = \mathbb{E}[\ell(\theta(X); Y)]$. Results are averaged over 50 random seeds with error bars corresponding to 95% confidence interval over the random runs.
...and 16 more figures

Theorems & Definitions (30)

Lemma 1: ShapiroDeRu14 and RockafellarUr00
Theorem 1
Theorem 2
Theorem 3
Theorem 4: Mendelson14, Theorem 3.1
Lemma 2: Mendelson14, Theorem 4.6
Definition 1: ShapiroDeRu14
Lemma 3: ShapiroDeRu14
Theorem 5
Proposition 6
...and 20 more

Evaluating Model Performance Under Worst-case Subpopulations

TL;DR

Abstract

Evaluating Model Performance Under Worst-case Subpopulations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (30)