Table of Contents
Fetching ...

Provable Robust Overfitting Mitigation in Wasserstein Distributionally Robust Optimization

Shuang Liu, Yihan Wang, Yifan Zhu, Yibo Miao, Xiao-Shan Gao

TL;DR

This work tackles robust overfitting in Wasserstein distributionally robust optimization by introducing Statistically Robust WDRO (SR-WDRO), which augments Wasserstein-based ambiguity with a KL-divergence constraint to account for statistical error from finite data. It provides a rigorous generalization bound showing adversarial test loss is controlled by the statistically robust training loss, and establishes the existence of Stackelberg and Nash equilibria under reasonable conditions. The authors derive a computationally tractable dual reformulation and adapt it to classification with a sample-shift cost, along with a practical training algorithm that preserves computational efficiency. Empirically, SR-WDRO significantly reduces robust overfitting and improves adversarial robustness on CIFAR-10/100 and related architectures, with a manageable increase in training time. Overall, SR-WDRO offers a theoretically grounded, scalable approach to robust distributional learning with practical benefits for unseen distribution shifts.

Abstract

Wasserstein distributionally robust optimization (WDRO) optimizes against worst-case distributional shifts within a specified uncertainty set, leading to enhanced generalization on unseen adversarial examples, compared to standard adversarial training which focuses on pointwise adversarial perturbations. However, WDRO still suffers fundamentally from the robust overfitting problem, as it does not consider statistical error. We address this gap by proposing a novel robust optimization framework under a new uncertainty set for adversarial noise via Wasserstein distance and statistical error via Kullback-Leibler divergence, called the Statistically Robust WDRO. We establish a robust generalization bound for the new optimization framework, implying that out-of-distribution adversarial performance is at least as good as the statistically robust training loss with high probability. Furthermore, we derive conditions under which Stackelberg and Nash equilibria exist between the learner and the adversary, giving an optimal robust model in certain sense. Finally, through extensive experiments, we demonstrate that our method significantly mitigates robust overfitting and enhances robustness within the framework of WDRO.

Provable Robust Overfitting Mitigation in Wasserstein Distributionally Robust Optimization

TL;DR

This work tackles robust overfitting in Wasserstein distributionally robust optimization by introducing Statistically Robust WDRO (SR-WDRO), which augments Wasserstein-based ambiguity with a KL-divergence constraint to account for statistical error from finite data. It provides a rigorous generalization bound showing adversarial test loss is controlled by the statistically robust training loss, and establishes the existence of Stackelberg and Nash equilibria under reasonable conditions. The authors derive a computationally tractable dual reformulation and adapt it to classification with a sample-shift cost, along with a practical training algorithm that preserves computational efficiency. Empirically, SR-WDRO significantly reduces robust overfitting and improves adversarial robustness on CIFAR-10/100 and related architectures, with a manageable increase in training time. Overall, SR-WDRO offers a theoretically grounded, scalable approach to robust distributional learning with practical benefits for unseen distribution shifts.

Abstract

Wasserstein distributionally robust optimization (WDRO) optimizes against worst-case distributional shifts within a specified uncertainty set, leading to enhanced generalization on unseen adversarial examples, compared to standard adversarial training which focuses on pointwise adversarial perturbations. However, WDRO still suffers fundamentally from the robust overfitting problem, as it does not consider statistical error. We address this gap by proposing a novel robust optimization framework under a new uncertainty set for adversarial noise via Wasserstein distance and statistical error via Kullback-Leibler divergence, called the Statistically Robust WDRO. We establish a robust generalization bound for the new optimization framework, implying that out-of-distribution adversarial performance is at least as good as the statistically robust training loss with high probability. Furthermore, we derive conditions under which Stackelberg and Nash equilibria exist between the learner and the adversary, giving an optimal robust model in certain sense. Finally, through extensive experiments, we demonstrate that our method significantly mitigates robust overfitting and enhances robustness within the framework of WDRO.

Paper Structure

This paper contains 27 sections, 12 theorems, 60 equations, 3 figures, 8 tables, 1 algorithm.

Key Result

Theorem 3

Let ${\mathcal{D}}$ be the true data distribution, and ${\mathcal{D}}_n$ be the observed empirical distribution sampled i.i.d. from ${\mathcal{D}}$. Then for all $\varepsilon >0$, let $\delta = (\frac{\varepsilon}{{\rm{diam}}({\mathcal{Z}})+1})^p$, we have where $m({\mathcal{Z}}, \delta):= \min\{k\ge 0: \exists \xi_1, \cdots, \xi_k \in {\mathcal{Z}}, \ \text{s.t.}\ \cup_{i=1}^k {\mathbb{B}}(\xi_i

Figures (3)

  • Figure 1: Illustration of our SR-WDRO. The adversarial perturbations are quantified using Wasserstein distance between ${\mathcal{D}}_n$ and $\widehat{{\mathcal{D}}}$. The adversarial distribution is then compared with the test distribution ${\mathcal{D}}_{{\rm{test}}}$ using Kullback-Leibler divergence, which accounts for the statistical error.
  • Figure 2: Comparison of SR-WDRO against other robust training methods on CIFAR10 $(\varepsilon= 8/255)$. Left: Robust test accuracy. Right: Robust test loss. Our method (green) demonstrates competitive performance in both metrics, particularly in mitigating robust overfitting and higher robust test accuracy.
  • Figure 3: Impact of statistical error $\gamma$ on mitigating overfitting in our method. Experiments conducted on CIFAR10 with $\varepsilon = 8/255$. Left: Robust test accuracy. Right: Robust test loss.

Theorems & Definitions (30)

  • Definition 1: Wasserstein metric
  • Definition 2: KL divergence
  • Theorem 3: Generalization certificate
  • Remark 4
  • Theorem 5: Robustness certificate
  • Remark 6: Comparison with Standard WDRO
  • Remark 7
  • Proposition 9
  • Theorem 10: Stackelberg Equilibrium
  • Theorem 11: Minimax Theorem
  • ...and 20 more