Table of Contents
Fetching ...

Evaluating Model Performance Under Worst-case Subpopulations

Mike Li, Daksh Mittal, Hongseok Namkoong, Shangzhou Xia

TL;DR

The paper defines worst-case subpopulation performance over core attributes $Z$ to certify distributional robustness before deployment. It develops a scalable two-stage estimation framework with a debiased, cross-fitted estimator based on a dual reformulation that makes tail-risk evaluation tractable even for high-dimensional $Z$. The authors provide asymptotic and finite-sample guarantees, including dimension-free concentration results, and validate the approach through simulations and case studies (Warfarin, ACS Income, FMoW), showing it can certify robustness and flag unreliable models while noting limitations for truly out-of-support shifts. They further connect this evaluation framework to coherent risk measures and distributionally robust optimization, highlighting implications for model assessment and data collection in practice.

Abstract

The performance of ML models degrades when the training population is different from that seen under operation. Towards assessing distributional robustness, we study the worst-case performance of a model over all subpopulations of a given size, defined with respect to core attributes Z. This notion of robustness can consider arbitrary (continuous) attributes Z, and automatically accounts for complex intersectionality in disadvantaged groups. We develop a scalable yet principled two-stage estimation procedure that can evaluate the robustness of state-of-the-art models. We prove that our procedure enjoys several finite-sample convergence guarantees, including dimension-free convergence. Instead of overly conservative notions based on Rademacher complexities, our evaluation error depends on the dimension of Z only through the out-of-sample error in estimating the performance conditional on Z. On real datasets, we demonstrate that our method certifies the robustness of a model and prevents deployment of unreliable models.

Evaluating Model Performance Under Worst-case Subpopulations

TL;DR

The paper defines worst-case subpopulation performance over core attributes to certify distributional robustness before deployment. It develops a scalable two-stage estimation framework with a debiased, cross-fitted estimator based on a dual reformulation that makes tail-risk evaluation tractable even for high-dimensional . The authors provide asymptotic and finite-sample guarantees, including dimension-free concentration results, and validate the approach through simulations and case studies (Warfarin, ACS Income, FMoW), showing it can certify robustness and flag unreliable models while noting limitations for truly out-of-support shifts. They further connect this evaluation framework to coherent risk measures and distributionally robust optimization, highlighting implications for model assessment and data collection in practice.

Abstract

The performance of ML models degrades when the training population is different from that seen under operation. Towards assessing distributional robustness, we study the worst-case performance of a model over all subpopulations of a given size, defined with respect to core attributes Z. This notion of robustness can consider arbitrary (continuous) attributes Z, and automatically accounts for complex intersectionality in disadvantaged groups. We develop a scalable yet principled two-stage estimation procedure that can evaluate the robustness of state-of-the-art models. We prove that our procedure enjoys several finite-sample convergence guarantees, including dimension-free convergence. Instead of overly conservative notions based on Rademacher complexities, our evaluation error depends on the dimension of Z only through the out-of-sample error in estimating the performance conditional on Z. On real datasets, we demonstrate that our method certifies the robustness of a model and prevents deployment of unreliable models.
Paper Structure (55 sections, 26 theorems, 146 equations, 21 figures, 6 tables, 1 algorithm)

This paper contains 55 sections, 26 theorems, 146 equations, 21 figures, 6 tables, 1 algorithm.

Key Result

Lemma 1

If $\mathbb{E}[h(Z)_+] < \infty$, then for $\alpha\in(0,1)$, The infimum is attained at $\eta = P_{1-\alpha}^{-1}(\mu)$. Moreover, if $h(Z)$ has no probability mass at $P_{1-\alpha}^{-1}(h)$, then $\mathsf{W}_{\alpha}(h) = \mathbb{E}[h(Z)\,|\, h(Z) \ge P_{1-\alpha}^{-1}(h)]$.

Figures (21)

  • Figure 1: Conditional risk $\mu(Z) = \mathbb{E}[(Y - \theta(X))^2 \mid Z]$. Here $Z$ = age on the left panel, $Z$ = race in the center, and $Z$ = (age, race) on the right. A = Asian, B = Black, U = Unknown, W = White.
  • Figure 2: Debiasing estimator yields substantially lower MSE and rapidly gains relative efficiency as $n$ grows. Improvement is reported as $(\text{MSE}_\text{plug-in} - \text{MSE}_\text{debiased})/\text{MSE}_\text{plug-in}$.
  • Figure 3: Debiasing sharply reduces bias without inflating variance. Bias improvements exceed $2\times$ at $n=10^2$ and remain at least $6\times$ through $n = 10^5$, while the variance stays comparable to the plug-in baseline.
  • Figure 4: $\widehat{\mathsf{W}}_{\alpha, k}(\widehat{h}_1)$ and $\mathsf{W}_{\alpha}(\theta)$ from simulation experiments with $\alpha=0.3$
  • Figure 5: Worst-case subpopulation performance $\mathsf{W}_{\alpha}(\theta)$, where $\mathsf{W}_{1.0}(\theta) = \mathbb{E}[\ell(\theta(X); Y)]$. Results are averaged over 50 random seeds with error bars corresponding to 95% confidence interval over the random runs.
  • ...and 16 more figures

Theorems & Definitions (30)

  • Lemma 1: ShapiroDeRu14 and RockafellarUr00
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4: Mendelson14, Theorem 3.1
  • Lemma 2: Mendelson14, Theorem 4.6
  • Definition 1: ShapiroDeRu14
  • Lemma 3: ShapiroDeRu14
  • Theorem 5
  • Proposition 6
  • ...and 20 more