Table of Contents
Fetching ...

On Generalization and Regularization via Wasserstein Distributionally Robust Optimization

Qinyu Wu, Jonathan Yu-Meng Li, Tiantian Mao

TL;DR

The paper addresses robust data-driven decision-making under distributional ambiguity by extending Wasserstein DRO to general type-$p$ balls and arbitrary risk criteria. It introduces a projection-set framework that connects high-dimensional joint distributions to tractable one-dimensional Wasserstein balls and max-sliced WDRO, yielding non-asymptotic, dimension-free generalization bounds for affine policies. It further develops a regularization perspective, proving exact equivalences between DRO and regularized empirical optimization for broad families of loss functions and distortion-based risk measures, with a regularization term that scales as $\varepsilon\|\boldsymbol{\beta}\|_*$ and, in many cases, is independent of the order $p$. The results extend to non-affine decision rules under mild tail assumptions and provide practical convex reformulations (e.g., CVaR-based SVM variants) that enable efficient computation. Overall, the work offers a universal theoretical foundation for applying Wasserstein DRO across a wide range of data-driven decision problems while guiding when simpler regularization suffices and when full DRO is necessary.

Abstract

Wasserstein distributionally robust optimization (DRO) has gained prominence in operations research and machine learning as a powerful method for achieving solutions with favorable out-of-sample performance. Two compelling explanations for its success are the generalization bounds derived from Wasserstein DRO and its equivalence to regularization schemes commonly used in machine learning. However, existing results on generalization bounds and regularization equivalence are largely limited to settings where the Wasserstein ball is of a specific type, and the decision criterion takes certain forms of expected functions. In this paper, we show that generalization bounds and regularization equivalence can be obtained in a significantly broader setting, where the Wasserstein ball is of a general type and the decision criterion accommodates any form, including general risk measures. This not only addresses important machine learning and operations management applications but also expands to general decision-theoretical frameworks previously unaddressed by Wasserstein DRO. Our results are strong in that the generalization bounds do not suffer from the curse of dimensionality and the equivalency to regularization is exact. As a by-product, we show that Wasserstein DRO coincides with the recent max-sliced Wasserstein DRO for {\it any} decision criterion under affine decision rules -- resulting in both being efficiently solvable as convex programs via our general regularization results. These general assurances provide a strong foundation for expanding the application of Wasserstein DRO across diverse domains of data-driven decision problems.

On Generalization and Regularization via Wasserstein Distributionally Robust Optimization

TL;DR

The paper addresses robust data-driven decision-making under distributional ambiguity by extending Wasserstein DRO to general type- balls and arbitrary risk criteria. It introduces a projection-set framework that connects high-dimensional joint distributions to tractable one-dimensional Wasserstein balls and max-sliced WDRO, yielding non-asymptotic, dimension-free generalization bounds for affine policies. It further develops a regularization perspective, proving exact equivalences between DRO and regularized empirical optimization for broad families of loss functions and distortion-based risk measures, with a regularization term that scales as and, in many cases, is independent of the order . The results extend to non-affine decision rules under mild tail assumptions and provide practical convex reformulations (e.g., CVaR-based SVM variants) that enable efficient computation. Overall, the work offers a universal theoretical foundation for applying Wasserstein DRO across a wide range of data-driven decision problems while guiding when simpler regularization suffices and when full DRO is necessary.

Abstract

Wasserstein distributionally robust optimization (DRO) has gained prominence in operations research and machine learning as a powerful method for achieving solutions with favorable out-of-sample performance. Two compelling explanations for its success are the generalization bounds derived from Wasserstein DRO and its equivalence to regularization schemes commonly used in machine learning. However, existing results on generalization bounds and regularization equivalence are largely limited to settings where the Wasserstein ball is of a specific type, and the decision criterion takes certain forms of expected functions. In this paper, we show that generalization bounds and regularization equivalence can be obtained in a significantly broader setting, where the Wasserstein ball is of a general type and the decision criterion accommodates any form, including general risk measures. This not only addresses important machine learning and operations management applications but also expands to general decision-theoretical frameworks previously unaddressed by Wasserstein DRO. Our results are strong in that the generalization bounds do not suffer from the curse of dimensionality and the equivalency to regularization is exact. As a by-product, we show that Wasserstein DRO coincides with the recent max-sliced Wasserstein DRO for {\it any} decision criterion under affine decision rules -- resulting in both being efficiently solvable as convex programs via our general regularization results. These general assurances provide a strong foundation for expanding the application of Wasserstein DRO across diverse domains of data-driven decision problems.
Paper Structure (19 sections, 29 theorems, 229 equations, 2 figures, 1 table)

This paper contains 19 sections, 29 theorems, 229 equations, 2 figures, 1 table.

Key Result

Proposition 1

Suppose that $p\in[1,\infty]$, $\varepsilon\geqslant 0$, $\bm\beta\in \mathbb{R}^n$, $F_0\in\mathcal{M}_p(\Xi)$ and $(Y_0,\mathbf X_0)\sim F_0$. Then, we have and where $\overline{\mathcal{B}}_{p|\boldsymbol{\beta}}$, ${\cal B}_p$, and $\mathcal{B}_p^{\rm ms}$ are defined by eq-HONE, eq-WU-multi-d, and eq-WU-multi-d2, respectively.

Figures (2)

  • Figure 1: Wasserstein radius versus the number of training samples, with $\alpha=0.2$
  • Figure 2: Wasserstein radius versus the number of training samples, with different $\alpha$ level (Left: $\alpha=0.2$; Middle: $\alpha=0.5$; Right: $\alpha=0.8$).

Theorems & Definitions (52)

  • Proposition 1
  • Theorem 1
  • Theorem 2
  • Example 1: Expected function
  • Example 2: Risk measure
  • Theorem 3
  • Example 3
  • Theorem 4
  • Remark 1
  • Remark 2
  • ...and 42 more