Table of Contents
Fetching ...

Taking a Moment for Distributional Robustness

Jabari Hastings, Christopher Jung, Charlotte Peale, Vasilis Syrgkanis

TL;DR

This work reframes distributionally robust learning from minimizing the worst-case loss across distributions to minimizing the worst-case distance to the true conditional expectation $h_0(X) = \mathbb{E}_D[Y|X]$. It introduces an adversarial moment violation objective $\min_{h\in\mathcal{H}} \max_{D\in\mathcal{D}} \max_{f\in\mathcal{F}} \mathbb{E}_{(X,Y)\sim D}[2\,(Y-h(X))f(X) - c f(X)^2]$ that, with a sufficiently rich $\mathcal{F}$, is equivalent to optimizing the worst-case $\ell_2$ distance to $h_0$ across distributions, and relates to minimax regret in square loss. The paper provides finite-sample guarantees, scalable computation across convex, linear, and RKHS settings, and empirical validation on synthetic RKHS data and the CelebA dataset, showing competitive worst-group performance with reduced runtime as the number of distributions grows. This approach advances scalable, noise-oblivious DRO by enabling efficient handling of many subpopulations and offers practical gains for fairness-aware learning in large-scale settings.

Abstract

A rich line of recent work has studied distributionally robust learning approaches that seek to learn a hypothesis that performs well, in the worst-case, on many different distributions over a population. We argue that although the most common approaches seek to minimize the worst-case loss over distributions, a more reasonable goal is to minimize the worst-case distance to the true conditional expectation of labels given each covariate. Focusing on the minmax loss objective can dramatically fail to output a solution minimizing the distance to the true conditional expectation when certain distributions contain high levels of label noise. We introduce a new min-max objective based on what is known as the adversarial moment violation and show that minimizing this objective is equivalent to minimizing the worst-case $\ell_2$-distance to the true conditional expectation if we take the adversary's strategy space to be sufficiently rich. Previous work has suggested minimizing the maximum regret over the worst-case distribution as a way to circumvent issues arising from differential noise levels. We show that in the case of square loss, minimizing the worst-case regret is also equivalent to minimizing the worst-case $\ell_2$-distance to the true conditional expectation. Although their objective and our objective both minimize the worst-case distance to the true conditional expectation, we show that our approach provides large empirical savings in computational cost in terms of the number of groups, while providing the same noise-oblivious worst-distribution guarantee as the minimax regret approach, thus making positive progress on an open question posed by Agarwal and Zhang (2022).

Taking a Moment for Distributional Robustness

TL;DR

This work reframes distributionally robust learning from minimizing the worst-case loss across distributions to minimizing the worst-case distance to the true conditional expectation . It introduces an adversarial moment violation objective that, with a sufficiently rich , is equivalent to optimizing the worst-case distance to across distributions, and relates to minimax regret in square loss. The paper provides finite-sample guarantees, scalable computation across convex, linear, and RKHS settings, and empirical validation on synthetic RKHS data and the CelebA dataset, showing competitive worst-group performance with reduced runtime as the number of distributions grows. This approach advances scalable, noise-oblivious DRO by enabling efficient handling of many subpopulations and offers practical gains for fairness-aware learning in large-scale settings.

Abstract

A rich line of recent work has studied distributionally robust learning approaches that seek to learn a hypothesis that performs well, in the worst-case, on many different distributions over a population. We argue that although the most common approaches seek to minimize the worst-case loss over distributions, a more reasonable goal is to minimize the worst-case distance to the true conditional expectation of labels given each covariate. Focusing on the minmax loss objective can dramatically fail to output a solution minimizing the distance to the true conditional expectation when certain distributions contain high levels of label noise. We introduce a new min-max objective based on what is known as the adversarial moment violation and show that minimizing this objective is equivalent to minimizing the worst-case -distance to the true conditional expectation if we take the adversary's strategy space to be sufficiently rich. Previous work has suggested minimizing the maximum regret over the worst-case distribution as a way to circumvent issues arising from differential noise levels. We show that in the case of square loss, minimizing the worst-case regret is also equivalent to minimizing the worst-case -distance to the true conditional expectation. Although their objective and our objective both minimize the worst-case distance to the true conditional expectation, we show that our approach provides large empirical savings in computational cost in terms of the number of groups, while providing the same noise-oblivious worst-distribution guarantee as the minimax regret approach, thus making positive progress on an open question posed by Agarwal and Zhang (2022).
Paper Structure (34 sections, 24 theorems, 133 equations, 4 figures, 2 tables, 4 algorithms)

This paper contains 34 sections, 24 theorems, 133 equations, 4 figures, 2 tables, 4 algorithms.

Key Result

Lemma 2.2

Fix any number $c \neq 0$. Consider the maximum violation criterion: Then we have:

Figures (4)

  • Figure 1: Performance of MRO, DRO, and our method (Adv-moment) on synthetic data generated with 50 groups each of size 100. We see that DRO favors the high-noise group while MRO and Adv-moments choose the regret-minimizing parabola halfway between the groups' true functions.
  • Figure 2: Runtime of MRO, DRO, and our method (Adv-moment) on synthetic data generated with increasing numbers of groups. Error bars represent standard error. We see that MRO suffers increasing slowdowns as the number of groups increases.
  • Figure 3: Example of synthetic data distribution for 10 groups, each of size 100. Each group is color-coded for easy differentiation. Every group follows a parabolic function with additive gaussian noise.
  • Figure 4: Runtime of our method (Adv-moment) and baselines (groupDRO (DRO), MRO) on a synthetic dataset of size 2000 as the number of groups increase. We see that when there are only two groups each of size 1000, MRO outperforms the runtime of the adversarial moment method, but this reduces as the number of groups is increased.

Theorems & Definitions (41)

  • Lemma 2.2: Completing the Square
  • proof
  • Corollary 2.3: Upper Bound
  • Corollary 2.4: Lower Bound
  • Remark 2.5
  • Theorem 2.6: Population Limit, MSE
  • Definition 2.7: Multiaccuracy Error
  • Theorem 3.1: MSE guarantee, Finite Samples Regularized
  • Remark 3.2: Advantages of Regularization
  • Remark 3.3
  • ...and 31 more