Table of Contents
Fetching ...

Distributionally Robust Losses for Latent Covariate Mixtures

John Duchi, Tatsunori Hashimoto, Hongseok Namkoong

TL;DR

The paper tackles the problem of achieving uniformly good performance across latent subpopulations when data come from a mixture of covariate distributions. It introduces marginal distributionally robust optimization (DRO) over latent subpopulations, leveraging a dual CVaR representation of worst-case risk and a scalable Lp Hölder variational bound to obtain tractable, nonparametric guarantees. The authors provide finite-sample bounds, convergence rates tied to the Wasserstein distance, and hardness results that illuminate dimension-driven limits. Empirically, the marginal DRO approach yields improved worst-case performance on tasks including semantic similarity, wine quality prediction, and recidivism prediction, while highlighting the computational and dimensional trade-offs. Overall, the work offers a rigorous, scalable framework for robust subpopulation performance that bridges covariate shift, fairness, and causal-inference perspectives.

Abstract

While modern large-scale datasets often consist of heterogeneous subpopulations -- for example, multiple demographic groups or multiple text corpora -- the standard practice of minimizing average loss fails to guarantee uniformly low losses across all subpopulations. We propose a convex procedure that controls the worst-case performance over all subpopulations of a given size. Our procedure comes with finite-sample (nonparametric) convergence guarantees on the worst-off subpopulation. Empirically, we observe on lexical similarity, wine quality, and recidivism prediction tasks that our worst-case procedure learns models that do well against unseen subpopulations.

Distributionally Robust Losses for Latent Covariate Mixtures

TL;DR

The paper tackles the problem of achieving uniformly good performance across latent subpopulations when data come from a mixture of covariate distributions. It introduces marginal distributionally robust optimization (DRO) over latent subpopulations, leveraging a dual CVaR representation of worst-case risk and a scalable Lp Hölder variational bound to obtain tractable, nonparametric guarantees. The authors provide finite-sample bounds, convergence rates tied to the Wasserstein distance, and hardness results that illuminate dimension-driven limits. Empirically, the marginal DRO approach yields improved worst-case performance on tasks including semantic similarity, wine quality prediction, and recidivism prediction, while highlighting the computational and dimensional trade-offs. Overall, the work offers a rigorous, scalable framework for robust subpopulation performance that bridges covariate shift, fairness, and causal-inference perspectives.

Abstract

While modern large-scale datasets often consist of heterogeneous subpopulations -- for example, multiple demographic groups or multiple text corpora -- the standard practice of minimizing average loss fails to guarantee uniformly low losses across all subpopulations. We propose a convex procedure that controls the worst-case performance over all subpopulations of a given size. Our procedure comes with finite-sample (nonparametric) convergence guarantees on the worst-off subpopulation. Empirically, we observe on lexical similarity, wine quality, and recidivism prediction tasks that our worst-case procedure learns models that do well against unseen subpopulations.

Paper Structure

This paper contains 59 sections, 19 theorems, 190 equations, 11 figures, 1 table.

Key Result

Lemma 2.1

If $\mathbb{E}[| \mathbb{E}[\ell(\theta; (X,Y)) \mid X]|] < \infty$, then If additionally $0 \le \mathbb{E}[\ell(\theta; (X,Y)) \mid X] \le M$ w.p. 1, the infimizing $\eta$ lies in $[0, M]$.

Figures (11)

  • Figure 1: Toy problem of $L_1$ regression through origin.
  • Figure 2: Dimension and sample size dependence of robust loss surrogates. The two marginal DRO methods correspond to different choices in the variational approximation ($L_p$ Hölder and Bounded Hölder).
  • Figure 3: Sensitivity of marginal DRO losses to test-time worst-case group size (left) and Lipschitz constant estimate (right).
  • Figure 4: Semantic similarity prediction task, with worst-case prediction error $R_{\alpha_0}(\theta)$ (Eq. \ref{['eq:semeval']}) over subgroups (y-axis) evaluated over varying test time worst-case group sizes $\alpha_0$ (x-axis).
  • Figure 5: Marginal DRO improves worst-case loss $R_{\alpha_0, \hbox{\scriptsize joint}}(\theta)$ for the wine quality prediction task under a real world red to white wine distribution shift. The gain holds on a wide range of Lipschitz constants from $L/\epsilon=0.1$ (left) to $300$ (right).
  • ...and 6 more figures

Theorems & Definitions (21)

  • Lemma 2.1
  • Proposition 1
  • Lemma 3.1
  • Lemma 3.2
  • Lemma 4.1
  • Lemma 4.2
  • Theorem 1
  • Proposition 2
  • Theorem 2
  • Corollary 1
  • ...and 11 more