Table of Contents
Fetching ...

Drago: Primal-Dual Coupled Variance Reduction for Faster Distributionally Robust Optimization

Ronak Mehta, Jelena Diakonikolas, Zaid Harchaoui

TL;DR

Drago addresses efficiency in penalized distributionally robust optimization by formulating a stochastic primal-dual method that reduces dual variance through a hybrid of randomized and cyclic updates and a novel primal regularization. The algorithm operates on a finite-sum DRO objective with a convex uncertainty set and achieves a linear convergence rate with a complexity depending on the minibatch size, the uncertainty-set size, and smoothness/strong convexity constants. Theoretical results establish a tight overall complexity bound while practical experiments on regression and text classification validate fast convergence and favorable wall-clock performance across varying data sizes and conditioning. This work advances scalable DRO by delivering provable linear convergence with general applicability to common uncertainty sets such as f-divergence balls and spectral risk measures, and demonstrates practical impact on large-scale learning under distribution shift.

Abstract

We consider the penalized distributionally robust optimization (DRO) problem with a closed, convex uncertainty set, a setting that encompasses learning using $f$-DRO and spectral/$L$-risk minimization. We present Drago, a stochastic primal-dual algorithm that combines cyclic and randomized components with a carefully regularized primal update to achieve dual variance reduction. Owing to its design, Drago enjoys a state-of-the-art linear convergence rate on strongly convex-strongly concave DRO problems with a fine-grained dependency on primal and dual condition numbers. Theoretical results are supported by numerical benchmarks on regression and classification tasks.

Drago: Primal-Dual Coupled Variance Reduction for Faster Distributionally Robust Optimization

TL;DR

Drago addresses efficiency in penalized distributionally robust optimization by formulating a stochastic primal-dual method that reduces dual variance through a hybrid of randomized and cyclic updates and a novel primal regularization. The algorithm operates on a finite-sum DRO objective with a convex uncertainty set and achieves a linear convergence rate with a complexity depending on the minibatch size, the uncertainty-set size, and smoothness/strong convexity constants. Theoretical results establish a tight overall complexity bound while practical experiments on regression and text classification validate fast convergence and favorable wall-clock performance across varying data sizes and conditioning. This work advances scalable DRO by delivering provable linear convergence with general applicability to common uncertainty sets such as f-divergence balls and spectral risk measures, and demonstrates practical impact on large-scale learning under distribution shift.

Abstract

We consider the penalized distributionally robust optimization (DRO) problem with a closed, convex uncertainty set, a setting that encompasses learning using -DRO and spectral/-risk minimization. We present Drago, a stochastic primal-dual algorithm that combines cyclic and randomized components with a carefully regularized primal update to achieve dual variance reduction. Owing to its design, Drago enjoys a state-of-the-art linear convergence rate on strongly convex-strongly concave DRO problems with a fine-grained dependency on primal and dual condition numbers. Theoretical results are supported by numerical benchmarks on regression and classification tasks.
Paper Structure (55 sections, 19 theorems, 139 equations, 6 figures, 4 tables)

This paper contains 55 sections, 19 theorems, 139 equations, 6 figures, 4 tables.

Key Result

Theorem 2

For a constant $\alpha > 0$, define the sequence along with its partial sum $A_t = \sum_{\tau = 1}^t a_\tau$. Under asm:main, there is an absolute constant $C$ such that using the parameter the iterates of algo:drago satisfy: We can compute a point $(w_T, q_T)$ achieving an expected gap no more than $\varepsilon$ with big-$O$ complexity

Figures (6)

  • Figure 1: Visualization of Uncertainty Sets and Penalties. Each plot is a probability simplex in $n=3$ dimensions with the uncertainty set as the colored portion. The black dots are optimal dual variables $q_\nu^\star := \operatorname*{arg\,max}_{q \in {\mathcal{Q}}} \sum_{i=1}^n q_i \ell_i(w) - \nu D(q \Vert \mathbf{1}/n)$ for a fixed $w \in \mathscr{W}$. As $\nu$ decreases, $q_\nu^\star$ may shift toward the boundary of the uncertainty set. The combination of $\nu$ and $D$ determines an "effective" uncertainty set, whose shape is given by the level sets of $D$. Our methods apply to both.
  • Figure 2: Regression Benchmarks. In both panels, the $y$-axis measures the primal suboptimality gap \ref{['eqn:subopt']}. Individual plots correspond to particular datasets. Left: The $x$-axis displays the number of individual first-order oracle queries to $\{(\ell_i, \nabla \ell_i)\}_{i=1}^n$. Right: The $x$-axis displays wall-clock time.
  • Figure 3: Text Classification Benchmarks. In all plots, the $y$-axis measures the normalized primal (i.e., DRO risk) suboptimality gap, defined in \ref{['eqn:subopt']}. Columns represent a varying dual regularization parameter $\nu$. On the first three columns the $x$-axis measures the number of individual first-order oracle queries to $\{(\ell_i, \nabla \ell_i)\}_{i=1}^n$ and the remaining three the $x$-axis displays wall-clock time. The objective becomes ill-conditioned as $\nu$ decreases.
  • Figure 4: Replicate of \ref{['tab:dro']}.
  • Figure 5: Benchmarks on the $\chi^2$ Uncertainty Set. In both panels, the $y$-axis measure the primal suboptimality gap, defined in \ref{['eqn:subopt']}. Individual plots correspond to particular datasets. Left: The $x$-axis displays the number of individual first-order oracle queries to $\{(\ell_i, \nabla \ell_i)\}_{i=1}^n$. Right: The $x$-axis displays wall-clock time.
  • ...and 1 more figures

Theorems & Definitions (34)

  • Theorem 2
  • Proposition 3
  • proof
  • Lemma 4: Cross Term Bound
  • proof
  • Lemma 5: Primal Noise Bound
  • proof
  • Lemma 6: Dual Noise Bound
  • proof
  • Lemma 7
  • ...and 24 more