Table of Contents
Fetching ...

Large-Scale Non-convex Stochastic Constrained Distributionally Robust Optimization

Qi Zhang, Yi Zhou, Ashley Prater-Bennette, Lixin Shen, Shaofeng Zou

TL;DR

This work tackles large-scale constrained distributionally robust optimization with non-convex losses, focusing on the general Cressie-Read divergence family. It develops a stochastic algorithm (SFK-DRO) that keeps per-iteration cost independent of dataset size by leveraging a dual formulation and a smooth, Lipschitz approximation, combining stochastic gradient updates for the decision variable with Frank-Wolfe updates for dual parameters. The authors prove convergence to an $ε$-stationary point with a rate characterized by $O(ε^{-3k_* -5})$ through a careful bias control and variance analysis, and they show the method extends to smoothed CVaR DRO. Empirical results on imbalanced CIFAR-10 demonstrate faster convergence and improved robustness compared with baselines, highlighting practical impact for large-scale, non-convex DRO tasks in settings with distributional shifts.

Abstract

Distributionally robust optimization (DRO) is a powerful framework for training robust models against data distribution shifts. This paper focuses on constrained DRO, which has an explicit characterization of the robustness level. Existing studies on constrained DRO mostly focus on convex loss function, and exclude the practical and challenging case with non-convex loss function, e.g., neural network. This paper develops a stochastic algorithm and its performance analysis for non-convex constrained DRO. The computational complexity of our stochastic algorithm at each iteration is independent of the overall dataset size, and thus is suitable for large-scale applications. We focus on the general Cressie-Read family divergence defined uncertainty set which includes $χ^2$-divergences as a special case. We prove that our algorithm finds an $ε$-stationary point with a computational complexity of $\mathcal O(ε^{-3k_*-5})$, where $k_*$ is the parameter of the Cressie-Read divergence. The numerical results indicate that our method outperforms existing methods.} Our method also applies to the smoothed conditional value at risk (CVaR) DRO.

Large-Scale Non-convex Stochastic Constrained Distributionally Robust Optimization

TL;DR

This work tackles large-scale constrained distributionally robust optimization with non-convex losses, focusing on the general Cressie-Read divergence family. It develops a stochastic algorithm (SFK-DRO) that keeps per-iteration cost independent of dataset size by leveraging a dual formulation and a smooth, Lipschitz approximation, combining stochastic gradient updates for the decision variable with Frank-Wolfe updates for dual parameters. The authors prove convergence to an -stationary point with a rate characterized by through a careful bias control and variance analysis, and they show the method extends to smoothed CVaR DRO. Empirical results on imbalanced CIFAR-10 demonstrate faster convergence and improved robustness compared with baselines, highlighting practical impact for large-scale, non-convex DRO tasks in settings with distributional shifts.

Abstract

Distributionally robust optimization (DRO) is a powerful framework for training robust models against data distribution shifts. This paper focuses on constrained DRO, which has an explicit characterization of the robustness level. Existing studies on constrained DRO mostly focus on convex loss function, and exclude the practical and challenging case with non-convex loss function, e.g., neural network. This paper develops a stochastic algorithm and its performance analysis for non-convex constrained DRO. The computational complexity of our stochastic algorithm at each iteration is independent of the overall dataset size, and thus is suitable for large-scale applications. We focus on the general Cressie-Read family divergence defined uncertainty set which includes -divergences as a special case. We prove that our algorithm finds an -stationary point with a computational complexity of , where is the parameter of the Cressie-Read divergence. The numerical results indicate that our method outperforms existing methods.} Our method also applies to the smoothed conditional value at risk (CVaR) DRO.
Paper Structure (27 sections, 5 theorems, 103 equations, 1 figure, 1 table, 1 algorithm)

This paper contains 27 sections, 5 theorems, 103 equations, 1 figure, 1 table, 1 algorithm.

Key Result

Lemma 1

$\forall x\in \mathbb R^d, 0\le \lambda_0\le \bar{\lambda}$,

Figures (1)

  • Figure 1: Training curve of classification task.

Theorems & Definitions (10)

  • Lemma 1
  • Lemma 2
  • Theorem 1
  • Lemma 3
  • Lemma 4
  • proof
  • proof
  • proof
  • proof
  • proof