Multi-level Monte-Carlo Gradient Methods for Stochastic Optimization with Biased Oracles

Yifan Hu; Jie Wang; Xin Chen; Niao He

Multi-level Monte-Carlo Gradient Methods for Stochastic Optimization with Biased Oracles

Yifan Hu, Jie Wang, Xin Chen, Niao He

TL;DR

This work studies stochastic optimization under biased oracles by introducing a unified multi-level Monte Carlo (MLMC) gradient framework. By telescoping gradients across levels, the authors construct several MLMC gradient estimators (V-MLMC, RT-MLMC, RU-MLMC, RR-MLMC) and couple them with SGD and variance-reduction techniques to achieve favorable bias-variance-cost tradeoffs. They provide nonasymptotic total-cost analyses across strongly convex, convex, and nonconvex settings, showing that, under suitable conditions, biased-oracle problems can match the complexity of classical unbiased stochastic optimization, and they give sharper improvements for conditional stochastic optimization, shortfall risk, and related tasks. The theory is complemented by extensive experiments in distributionally robust optimization, pricing/staffing, and contrastive learning, demonstrating substantial sample-efficient gains of MLMC gradient methods in practice.

Abstract

We consider stochastic optimization when one only has access to biased stochastic oracles of the objective and the gradient, and obtaining stochastic gradients with low biases comes at high costs. This setting captures various optimization paradigms, such as conditional stochastic optimization, distributionally robust optimization, shortfall risk optimization, and machine learning paradigms, such as contrastive learning. We examine a family of multi-level Monte Carlo (MLMC) gradient methods that exploit a delicate tradeoff among bias, variance, and oracle cost. We systematically study their total sample and computational complexities for strongly convex, convex, and nonconvex objectives and demonstrate their superiority over the widely used biased stochastic gradient method. When combined with the variance reduction techniques like SPIDER, these MLMC gradient methods can further reduce the complexity in the nonconvex regime. Our results imply that a series of stochastic optimization problems with biased oracles, previously considered to be more challenging, is fundamentally no harder than the classical stochastic optimization with unbiased oracles. We also delineate the boundary conditions under which these problems become more difficult. Moreover, MLMC gradient methods significantly improve the best-known complexities in the literature for conditional stochastic optimization and shortfall risk optimization. Our extensive numerical experiments on distributionally robust optimization, pricing and staffing scheduling problems, and contrastive learning demonstrate the superior performance of MLMC gradient methods.

Multi-level Monte-Carlo Gradient Methods for Stochastic Optimization with Biased Oracles

TL;DR

Abstract

Paper Structure (51 sections, 25 theorems, 137 equations, 5 figures, 11 tables, 2 algorithms)

This paper contains 51 sections, 25 theorems, 137 equations, 5 figures, 11 tables, 2 algorithms.

Introduction
Motivating Examples
Multilevel Monte Carlo (MLMC) Gradient Estimation
Our Contributions
Related Literature
Notations
Organizations
MLMC Gradient Methods
Biased Oracle Setting
SGD with MLMC Gradient Estimators
Variance Reduced Methods with MLMC Gradient Estimators
(Expected) Total Cost Analysis
Total Cost of
Expected Total Cost of RT-MLMC
Total Cost of V-MLMC
...and 36 more sections

Key Result

Lemma 1

Under Assumption assumption:general, for any $x\in\mathbb{R}^d$, the variance and per-iteration cost of $L$-SGD estimator $v^{L\text{-}\mathrm{SGD}}(x)$ with batch size $n_L$ satisfy

Figures (5)

Figure 1: Comparison results of $L$-SGD, V-MLMC, RT-MLMC, RU-MLMC, and RR-MLMC methods when $f_x(z)$ is convex. The $x$-axes represent the number of generated samples, and $y$-axes represent objective values. The results are averaged with error bars based on $10$ independent runs.
Figure 2: Comparison results of various gradient methods versus their variance reduction counterparts when $f_x(z)$ is nonconvex.
Figure 3: Left: comparison results of $L$-SGD, V-MLMC, RT-MLMC, RU-MLMC, and RR-MLMC methods when $f_x(z)$ is nonconvex; Right: comparison results of variance reduction counterparts of MLMC gradient estimators.
Figure 4: Comparison results of $L$-SGD and VR RT-MLMC methods on contrastive learning with CIFAR-10, CIFAR-100, and SVHN datasets.
Figure 5: Comparison results of $L$-SGD and VR RT-MLMC estimators for the joint pricing and staffing task in a stochastic system with various service time distributions. The results are averaged with error bars based on $10$ independent runs.

Theorems & Definitions (45)

Remark 1: Relationship between Assumptions \ref{['assumption:general']}\ref{['assumption:general:I']} and \ref{['assumption:gradient_bias']}
Remark 2: Intuition on the superiority of MLMC gradient over $L$-SGD
Lemma 1: Variance and Per-Iteration Cost of $L$-SGD
Theorem 1: Total cost of $L$-SGD
Lemma 2: Variance and Per-Iteration Cost of RT-MLMC
Remark 3: Construction of the distribution $Q := \{q_l\}_{l=0}^L$
Theorem 2: Expected Total cost of RT-MLMC
Lemma 3: Variance and Per-Iteration Cost of V-MLMC
Theorem 3: Total cost for V-MLMC when $b\geq c$
Remark 4: Why V-MLMC requires a large mini-batch size
...and 35 more

Multi-level Monte-Carlo Gradient Methods for Stochastic Optimization with Biased Oracles

TL;DR

Abstract

Multi-level Monte-Carlo Gradient Methods for Stochastic Optimization with Biased Oracles

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (45)