Taking a Moment for Distributional Robustness
Jabari Hastings, Christopher Jung, Charlotte Peale, Vasilis Syrgkanis
TL;DR
This work reframes distributionally robust learning from minimizing the worst-case loss across distributions to minimizing the worst-case distance to the true conditional expectation $h_0(X) = \mathbb{E}_D[Y|X]$. It introduces an adversarial moment violation objective $\min_{h\in\mathcal{H}} \max_{D\in\mathcal{D}} \max_{f\in\mathcal{F}} \mathbb{E}_{(X,Y)\sim D}[2\,(Y-h(X))f(X) - c f(X)^2]$ that, with a sufficiently rich $\mathcal{F}$, is equivalent to optimizing the worst-case $\ell_2$ distance to $h_0$ across distributions, and relates to minimax regret in square loss. The paper provides finite-sample guarantees, scalable computation across convex, linear, and RKHS settings, and empirical validation on synthetic RKHS data and the CelebA dataset, showing competitive worst-group performance with reduced runtime as the number of distributions grows. This approach advances scalable, noise-oblivious DRO by enabling efficient handling of many subpopulations and offers practical gains for fairness-aware learning in large-scale settings.
Abstract
A rich line of recent work has studied distributionally robust learning approaches that seek to learn a hypothesis that performs well, in the worst-case, on many different distributions over a population. We argue that although the most common approaches seek to minimize the worst-case loss over distributions, a more reasonable goal is to minimize the worst-case distance to the true conditional expectation of labels given each covariate. Focusing on the minmax loss objective can dramatically fail to output a solution minimizing the distance to the true conditional expectation when certain distributions contain high levels of label noise. We introduce a new min-max objective based on what is known as the adversarial moment violation and show that minimizing this objective is equivalent to minimizing the worst-case $\ell_2$-distance to the true conditional expectation if we take the adversary's strategy space to be sufficiently rich. Previous work has suggested minimizing the maximum regret over the worst-case distribution as a way to circumvent issues arising from differential noise levels. We show that in the case of square loss, minimizing the worst-case regret is also equivalent to minimizing the worst-case $\ell_2$-distance to the true conditional expectation. Although their objective and our objective both minimize the worst-case distance to the true conditional expectation, we show that our approach provides large empirical savings in computational cost in terms of the number of groups, while providing the same noise-oblivious worst-distribution guarantee as the minimax regret approach, thus making positive progress on an open question posed by Agarwal and Zhang (2022).
