Drago: Primal-Dual Coupled Variance Reduction for Faster Distributionally Robust Optimization

Ronak Mehta; Jelena Diakonikolas; Zaid Harchaoui

Drago: Primal-Dual Coupled Variance Reduction for Faster Distributionally Robust Optimization

Ronak Mehta, Jelena Diakonikolas, Zaid Harchaoui

TL;DR

Drago addresses efficiency in penalized distributionally robust optimization by formulating a stochastic primal-dual method that reduces dual variance through a hybrid of randomized and cyclic updates and a novel primal regularization. The algorithm operates on a finite-sum DRO objective with a convex uncertainty set and achieves a linear convergence rate with a complexity depending on the minibatch size, the uncertainty-set size, and smoothness/strong convexity constants. Theoretical results establish a tight overall complexity bound while practical experiments on regression and text classification validate fast convergence and favorable wall-clock performance across varying data sizes and conditioning. This work advances scalable DRO by delivering provable linear convergence with general applicability to common uncertainty sets such as f-divergence balls and spectral risk measures, and demonstrates practical impact on large-scale learning under distribution shift.

Abstract

We consider the penalized distributionally robust optimization (DRO) problem with a closed, convex uncertainty set, a setting that encompasses learning using $f$-DRO and spectral/$L$-risk minimization. We present Drago, a stochastic primal-dual algorithm that combines cyclic and randomized components with a carefully regularized primal update to achieve dual variance reduction. Owing to its design, Drago enjoys a state-of-the-art linear convergence rate on strongly convex-strongly concave DRO problems with a fine-grained dependency on primal and dual condition numbers. Theoretical results are supported by numerical benchmarks on regression and classification tasks.

Drago: Primal-Dual Coupled Variance Reduction for Faster Distributionally Robust Optimization

TL;DR

Abstract

We consider the penalized distributionally robust optimization (DRO) problem with a closed, convex uncertainty set, a setting that encompasses learning using

-DRO and spectral/

-risk minimization. We present Drago, a stochastic primal-dual algorithm that combines cyclic and randomized components with a carefully regularized primal update to achieve dual variance reduction. Owing to its design, Drago enjoys a state-of-the-art linear convergence rate on strongly convex-strongly concave DRO problems with a fine-grained dependency on primal and dual condition numbers. Theoretical results are supported by numerical benchmarks on regression and classification tasks.

Paper Structure (55 sections, 19 theorems, 139 equations, 6 figures, 4 tables)

This paper contains 55 sections, 19 theorems, 139 equations, 6 figures, 4 tables.

Introduction
Contributions
The drago Algorithm
Notation & Terminology
Algorithm Description
Computational Complexity
Theoretical Analysis
Convergence Analysis
Experiments
Regression with Large Block Sizes
Results
Text Classification Under Ill-Conditioning
Results
Conclusion
Acknowledgements.
...and 40 more sections

Key Result

Theorem 2

For a constant $\alpha > 0$, define the sequence along with its partial sum $A_t = \sum_{\tau = 1}^t a_\tau$. Under asm:main, there is an absolute constant $C$ such that using the parameter the iterates of algo:drago satisfy: We can compute a point $(w_T, q_T)$ achieving an expected gap no more than $\varepsilon$ with big-$O$ complexity

Figures (6)

Figure 1: Visualization of Uncertainty Sets and Penalties. Each plot is a probability simplex in $n=3$ dimensions with the uncertainty set as the colored portion. The black dots are optimal dual variables $q_\nu^\star := \operatorname*{arg\,max}_{q \in {\mathcal{Q}}} \sum_{i=1}^n q_i \ell_i(w) - \nu D(q \Vert \mathbf{1}/n)$ for a fixed $w \in \mathscr{W}$. As $\nu$ decreases, $q_\nu^\star$ may shift toward the boundary of the uncertainty set. The combination of $\nu$ and $D$ determines an "effective" uncertainty set, whose shape is given by the level sets of $D$. Our methods apply to both.
Figure 2: Regression Benchmarks. In both panels, the $y$-axis measures the primal suboptimality gap \ref{['eqn:subopt']}. Individual plots correspond to particular datasets. Left: The $x$-axis displays the number of individual first-order oracle queries to $\{(\ell_i, \nabla \ell_i)\}_{i=1}^n$. Right: The $x$-axis displays wall-clock time.
Figure 3: Text Classification Benchmarks. In all plots, the $y$-axis measures the normalized primal (i.e., DRO risk) suboptimality gap, defined in \ref{['eqn:subopt']}. Columns represent a varying dual regularization parameter $\nu$. On the first three columns the $x$-axis measures the number of individual first-order oracle queries to $\{(\ell_i, \nabla \ell_i)\}_{i=1}^n$ and the remaining three the $x$-axis displays wall-clock time. The objective becomes ill-conditioned as $\nu$ decreases.
Figure 4: Replicate of \ref{['tab:dro']}.
Figure 5: Benchmarks on the $\chi^2$ Uncertainty Set. In both panels, the $y$-axis measure the primal suboptimality gap, defined in \ref{['eqn:subopt']}. Individual plots correspond to particular datasets. Left: The $x$-axis displays the number of individual first-order oracle queries to $\{(\ell_i, \nabla \ell_i)\}_{i=1}^n$. Right: The $x$-axis displays wall-clock time.
...and 1 more figures

Theorems & Definitions (34)

Theorem 2
Proposition 3
proof
Lemma 4: Cross Term Bound
proof
Lemma 5: Primal Noise Bound
proof
Lemma 6: Dual Noise Bound
proof
Lemma 7
...and 24 more

Drago: Primal-Dual Coupled Variance Reduction for Faster Distributionally Robust Optimization

TL;DR

Abstract

Drago: Primal-Dual Coupled Variance Reduction for Faster Distributionally Robust Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (34)