Taking a Moment for Distributional Robustness

Jabari Hastings; Christopher Jung; Charlotte Peale; Vasilis Syrgkanis

Taking a Moment for Distributional Robustness

Jabari Hastings, Christopher Jung, Charlotte Peale, Vasilis Syrgkanis

TL;DR

This work reframes distributionally robust learning from minimizing the worst-case loss across distributions to minimizing the worst-case distance to the true conditional expectation $h_0(X) = \mathbb{E}_D[Y|X]$. It introduces an adversarial moment violation objective $\min_{h\in\mathcal{H}} \max_{D\in\mathcal{D}} \max_{f\in\mathcal{F}} \mathbb{E}_{(X,Y)\sim D}[2\,(Y-h(X))f(X) - c f(X)^2]$ that, with a sufficiently rich $\mathcal{F}$, is equivalent to optimizing the worst-case $\ell_2$ distance to $h_0$ across distributions, and relates to minimax regret in square loss. The paper provides finite-sample guarantees, scalable computation across convex, linear, and RKHS settings, and empirical validation on synthetic RKHS data and the CelebA dataset, showing competitive worst-group performance with reduced runtime as the number of distributions grows. This approach advances scalable, noise-oblivious DRO by enabling efficient handling of many subpopulations and offers practical gains for fairness-aware learning in large-scale settings.

Abstract

A rich line of recent work has studied distributionally robust learning approaches that seek to learn a hypothesis that performs well, in the worst-case, on many different distributions over a population. We argue that although the most common approaches seek to minimize the worst-case loss over distributions, a more reasonable goal is to minimize the worst-case distance to the true conditional expectation of labels given each covariate. Focusing on the minmax loss objective can dramatically fail to output a solution minimizing the distance to the true conditional expectation when certain distributions contain high levels of label noise. We introduce a new min-max objective based on what is known as the adversarial moment violation and show that minimizing this objective is equivalent to minimizing the worst-case $\ell_2$-distance to the true conditional expectation if we take the adversary's strategy space to be sufficiently rich. Previous work has suggested minimizing the maximum regret over the worst-case distribution as a way to circumvent issues arising from differential noise levels. We show that in the case of square loss, minimizing the worst-case regret is also equivalent to minimizing the worst-case $\ell_2$-distance to the true conditional expectation. Although their objective and our objective both minimize the worst-case distance to the true conditional expectation, we show that our approach provides large empirical savings in computational cost in terms of the number of groups, while providing the same noise-oblivious worst-distribution guarantee as the minimax regret approach, thus making positive progress on an open question posed by Agarwal and Zhang (2022).

Taking a Moment for Distributional Robustness

TL;DR

This work reframes distributionally robust learning from minimizing the worst-case loss across distributions to minimizing the worst-case distance to the true conditional expectation

. It introduces an adversarial moment violation objective

that, with a sufficiently rich

, is equivalent to optimizing the worst-case

distance to

across distributions, and relates to minimax regret in square loss. The paper provides finite-sample guarantees, scalable computation across convex, linear, and RKHS settings, and empirical validation on synthetic RKHS data and the CelebA dataset, showing competitive worst-group performance with reduced runtime as the number of distributions grows. This approach advances scalable, noise-oblivious DRO by enabling efficient handling of many subpopulations and offers practical gains for fairness-aware learning in large-scale settings.

Abstract

-distance to the true conditional expectation if we take the adversary's strategy space to be sufficiently rich. Previous work has suggested minimizing the maximum regret over the worst-case distribution as a way to circumvent issues arising from differential noise levels. We show that in the case of square loss, minimizing the worst-case regret is also equivalent to minimizing the worst-case

-distance to the true conditional expectation. Although their objective and our objective both minimize the worst-case distance to the true conditional expectation, we show that our approach provides large empirical savings in computational cost in terms of the number of groups, while providing the same noise-oblivious worst-distribution guarantee as the minimax regret approach, thus making positive progress on an open question posed by Agarwal and Zhang (2022).

Paper Structure (34 sections, 24 theorems, 133 equations, 4 figures, 2 tables, 4 algorithms)

This paper contains 34 sections, 24 theorems, 133 equations, 4 figures, 2 tables, 4 algorithms.

Introduction
Related Work
Distributionally Robust Optimization
Fair Solutions via Worst-Case Moment Violation Minimization
Finite Sample Analysis
Sample Complexity for Linear Function Classes
Efficient Computation
Convex Spaces
Linear Spaces
Reproducing Kernel Hilbert Spaces
Experiments
Synthetic Data and RKHS
CelebA and Neural Network
Discussion
Broader Impact
...and 19 more sections

Key Result

Lemma 2.2

Fix any number $c \neq 0$. Consider the maximum violation criterion: Then we have:

Figures (4)

Figure 1: Performance of MRO, DRO, and our method (Adv-moment) on synthetic data generated with 50 groups each of size 100. We see that DRO favors the high-noise group while MRO and Adv-moments choose the regret-minimizing parabola halfway between the groups' true functions.
Figure 2: Runtime of MRO, DRO, and our method (Adv-moment) on synthetic data generated with increasing numbers of groups. Error bars represent standard error. We see that MRO suffers increasing slowdowns as the number of groups increases.
Figure 3: Example of synthetic data distribution for 10 groups, each of size 100. Each group is color-coded for easy differentiation. Every group follows a parabolic function with additive gaussian noise.
Figure 4: Runtime of our method (Adv-moment) and baselines (groupDRO (DRO), MRO) on a synthetic dataset of size 2000 as the number of groups increase. We see that when there are only two groups each of size 1000, MRO outperforms the runtime of the adversarial moment method, but this reduces as the number of groups is increased.

Theorems & Definitions (41)

Lemma 2.2: Completing the Square
proof
Corollary 2.3: Upper Bound
Corollary 2.4: Lower Bound
Remark 2.5
Theorem 2.6: Population Limit, MSE
Definition 2.7: Multiaccuracy Error
Theorem 3.1: MSE guarantee, Finite Samples Regularized
Remark 3.2: Advantages of Regularization
Remark 3.3
...and 31 more

Taking a Moment for Distributional Robustness

TL;DR

Abstract

Taking a Moment for Distributional Robustness

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (41)