Table of Contents
Fetching ...

Learning Models with Uniform Performance via Distributionally Robust Optimization

John Duchi, Hongseok Namkoong

TL;DR

The paper develops a convex distributionally robust optimization framework to achieve uniform performance under distributional shifts and latent subpopulations. By formulating robustness via f-divergence balls around the nominal distribution, it links worst-case risk to tail performance and derives a tractable plug-in estimator with a dual formulation. The authors provide finite-sample convergence guarantees, minimax lower bounds, and asymptotic normality results, clarifying the statistical costs and tradeoffs of robustness. Empirical studies across domain adaptation, tail performance, and fine-grained subpopulations demonstrate improved tail and subpopulation performance at a controlled cost to average performance, with practical heuristics for choosing the robustness parameters. The work offers a principled, scalable approach to robust learning with theoretical guarantees and broad applicability to safety- and fairness-critical tasks.

Abstract

A common goal in statistics and machine learning is to learn models that can perform well against distributional shifts, such as latent heterogeneous subpopulations, unknown covariate shifts, or unmodeled temporal effects. We develop and analyze a distributionally robust stochastic optimization (DRO) framework that learns a model providing good performance against perturbations to the data-generating distribution. We give a convex formulation for the problem, providing several convergence guarantees. We prove finite-sample minimax upper and lower bounds, showing that distributional robustness sometimes comes at a cost in convergence rates. We give limit theorems for the learned parameters, where we fully specify the limiting distribution so that confidence intervals can be computed. On real tasks including generalizing to unknown subpopulations, fine-grained recognition, and providing good tail performance, the distributionally robust approach often exhibits improved performance.

Learning Models with Uniform Performance via Distributionally Robust Optimization

TL;DR

The paper develops a convex distributionally robust optimization framework to achieve uniform performance under distributional shifts and latent subpopulations. By formulating robustness via f-divergence balls around the nominal distribution, it links worst-case risk to tail performance and derives a tractable plug-in estimator with a dual formulation. The authors provide finite-sample convergence guarantees, minimax lower bounds, and asymptotic normality results, clarifying the statistical costs and tradeoffs of robustness. Empirical studies across domain adaptation, tail performance, and fine-grained subpopulations demonstrate improved tail and subpopulation performance at a controlled cost to average performance, with practical heuristics for choosing the robustness parameters. The work offers a principled, scalable approach to robust learning with theoretical guarantees and broad applicability to safety- and fairness-critical tasks.

Abstract

A common goal in statistics and machine learning is to learn models that can perform well against distributional shifts, such as latent heterogeneous subpopulations, unknown covariate shifts, or unmodeled temporal effects. We develop and analyze a distributionally robust stochastic optimization (DRO) framework that learns a model providing good performance against perturbations to the data-generating distribution. We give a convex formulation for the problem, providing several convergence guarantees. We prove finite-sample minimax upper and lower bounds, showing that distributional robustness sometimes comes at a cost in convergence rates. We give limit theorems for the learned parameters, where we fully specify the limiting distribution so that confidence intervals can be computed. On real tasks including generalizing to unknown subpopulations, fine-grained recognition, and providing good tail performance, the distributionally robust approach often exhibits improved performance.

Paper Structure

This paper contains 51 sections, 34 theorems, 229 equations, 6 figures, 1 table.

Key Result

Proposition 1

Let $P$ be a probability measure on $(\mathcal{X}, \mathcal{A})$ and $\rho > 0$. Then for all $\theta$. Moreover, if the supremum on the left hand side is finite, there are finite $\lambda(\theta) \ge 0$ and $\eta(\theta) \in \mathbb{R}$ attaining the infimum on the right hand side.

Figures (6)

  • Figure 1: (a) Hinge losses (average and 90th percentile in solid and dashed lines, respectively) under distributional shifts from $\theta_0^\star$ to $\theta_t^\star = \theta_0^\star \cdot \cos t + v \cdot \sin t$. The horizontal axis indexes perturbation $t$. (b) Losses on minority group (solid-line) and majority group (dotted-line) under the distribution \ref{['eqn:thresh']}. We define the minority group as those with $X^1 \le z_{.95}$.
  • Figure 2: Two groups: Figures (a) and (b) plots average and minority group losses under the distribution \ref{['eqn:first-scenario']}. "YSplit" is the performance of the model whose $\rho$ and $k$ was chosen based on groups formed by sorted values of $Y$.
  • Figure 3: Infinite groups: Figures (a) and (b) plot average and minority group losses under the distribution \ref{['eqn:infinite-scenario']}. "YSplit" is the performance of the model whose $\rho$ and $k$ was chosen based on groups formed by sorted values of $Y$, and "$G = .5$" chose $k$ and $\rho$ based on auxiliary data with intervention $G = 0.5$.
  • Figure 4: (a) Test error on the hand-written digits (MNIST test dataset). (b)--(d) Test errors on type-written digits. Models were trained on data consisting of MNIST hand-written digits with 0--10% replaced by type-written digits. The horizontal axis of each plot denotes percentage of type-written digits (relative to handwritten) in training. Each of the six lines represents a different value of $\rho$ used in training, where $\rho = 0$ corresponds to empirical risk minimization (ERM). (b) Classification error on entire test set of type-written digits. (c) Classification error on digit 3 of the type-written digits. (d) Classification errors for digit 9 of the type-written digits.
  • Figure 5: Median and maximal loss $|Y - Z^\top \theta|$ evaluated on training and test datasets. Values of the $x$-axis corresponds to different indices for the values of $\rho$ and $r$, so that "$x$-axis = 1" for the $\ell_1$-constrained problem corresponds to $r = 5$, and for the distributionally robust method \ref{['eqn:plug-in']} it corresponds to $\rho = .001$. Error bars correspond to standard error.
  • ...and 1 more figures

Theorems & Definitions (37)

  • Proposition 1: Shapiro17
  • Lemma 1
  • Theorem 2
  • Corollary 1
  • Corollary 2
  • Theorem 3
  • Proposition 4
  • Proposition 5
  • Theorem 6
  • Proposition 7
  • ...and 27 more