Accelerated Parameter-Free Stochastic Optimization

Itai Kreisler; Maor Ivgi; Oliver Hinder; Yair Carmon

Accelerated Parameter-Free Stochastic Optimization

Itai Kreisler, Maor Ivgi, Oliver Hinder, Yair Carmon

TL;DR

This work addresses the challenge of accelerated stochastic optimization in the smooth convex setting without requiring exact problem parameters. It introduces U-DoG, a parameter-free accelerated method that combines UniXGrad and DoG with iterate stabilization, using the evolving drift $\bar{r}_t$ to adapt momentum and ensure stability, and proving near-optimal high-probability rates under sub-Gaussian noise. The analysis covers both noiseless and stochastic scenarios, providing general suboptimality bounds, stability guarantees, and extensions to bounded and sub-Gaussian noise, plus a mini-batch corollary and a discussion of the parameter-free nature. Empirically, U-DoG (and the variant A-DoG) improves over DoG on convex problems and is competitive with carefully tuned SGD, while neural network experiments show more mixed results, highlighting the method’s strength in parameter-free acceleration for convex stochastic optimization and its current limitations in non-convex deep learning settings.

Abstract

We propose a method that achieves near-optimal rates for smooth stochastic convex optimization and requires essentially no prior knowledge of problem parameters. This improves on prior work which requires knowing at least the initial distance to optimality d0. Our method, U-DoG, combines UniXGrad (Kavis et al., 2019) and DoG (Ivgi et al., 2023) with novel iterate stabilization techniques. It requires only loose bounds on d0 and the noise magnitude, provides high probability guarantees under sub-Gaussian noise, and is also near-optimal in the non-smooth case. Our experiments show consistent, strong performance on convex problems and mixed results on neural network training.

Accelerated Parameter-Free Stochastic Optimization

TL;DR

to adapt momentum and ensure stability, and proving near-optimal high-probability rates under sub-Gaussian noise. The analysis covers both noiseless and stochastic scenarios, providing general suboptimality bounds, stability guarantees, and extensions to bounded and sub-Gaussian noise, plus a mini-batch corollary and a discussion of the parameter-free nature. Empirically, U-DoG (and the variant A-DoG) improves over DoG on convex problems and is competitive with carefully tuned SGD, while neural network experiments show more mixed results, highlighting the method’s strength in parameter-free acceleration for convex stochastic optimization and its current limitations in non-convex deep learning settings.

Abstract

Paper Structure (64 sections, 26 theorems, 229 equations, 19 figures, 1 table)

This paper contains 64 sections, 26 theorems, 229 equations, 19 figures, 1 table.

Introduction
Our contribution.
Related work
Non-smooth stochastic optimization.
Non-stochastic smooth optimization.
Smooth stochastic optimization.
Preliminaries and algorithmic framework
Basic notation and conventions.
Presenting $\textsc{U-DoG}\xspace$.
$\textsc{UniXGrad}\xspace$ as a special case.
Analysis in the noiseless case
General suboptimally bound
Iterate stability
Rate of convergence in the noiseless case
Analysis in the stochastic case
...and 49 more sections

Key Result

Proposition 1

In the noiseless setting (ass:noiseless), suppose the U-DoG step sizes eq:step-size-form satisfy $G_{x,t} \ge Q_{t-1}$ for all $t\ge 0$. Then for every $t\ge0$ and for any number $s\ge0$, we have

Figures (19)

Figure 1: Training a linear model with ViT-B/32 features and least-squares loss on SVHN. Top: Train loss. Bottom: Test accuracy after iterate averaging. First column: Batch size scaling of complexity to reach target performance. Second column: Learning curves. Third column: ASGD performance at all learning rates and momenta, contrasted with DoG variants.
Figure 2: Training a linear model with ViT-B/32 features and log loss on SVHN. Top: Train loss. Bottom: Test accuracy after iterate averaging. First column: Batch size scaling of complexity to reach target performance. Second column: Learning curves. Third column: ASGD performance at all learning rates and momenta, contrasted with DoG variants.
Figure 4: Training a linear model with ViT-B/32 features and log loss on CIFAR-100. Top: Train loss. Bottom: Test accuracy after iterate averaging. First column: Batch size scaling of complexity to reach target performance. Second column: Learning curves. Third column: ASGD performance at all learning rates and momenta, contrasted with DoG variants.
Figure 6: Training a linear model with ViT-B/32 features and log loss on DMLab. Top: Train loss. Bottom: Test accuracy after iterate averaging. First column: Batch size scaling of complexity to reach target performance. Second column: Learning curves. Third column: ASGD performance at all learning rates and momenta, contrasted with DoG variants.
Figure 8: Training a linear model with ViT-B/32 features and log loss on Resisc45. Top: Train loss. Bottom: Test accuracy after iterate averaging. First column: Batch size scaling of complexity to reach target performance. Second column: Learning curves. Third column: ASGD performance at all learning rates and momenta, contrasted with DoG variants.
...and 14 more figures

Theorems & Definitions (47)

Proposition 1
Proposition 2
Theorem 1
Proposition 3
Proposition 4
Theorem 2
Corollary 1
Corollary 2
proof
proof
...and 37 more

Accelerated Parameter-Free Stochastic Optimization

TL;DR

Abstract

Accelerated Parameter-Free Stochastic Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (47)