Adaptive Variance Reduction for Stochastic Optimization under Weaker Assumptions

Wei Jiang; Sifan Yang; Yibo Wang; Lijun Zhang

Adaptive Variance Reduction for Stochastic Optimization under Weaker Assumptions

Wei Jiang, Sifan Yang, Yibo Wang, Lijun Zhang

TL;DR

A novel adaptive STORM method is introduced that achieves an optimal convergence rate of $\mathcal{O}(T^{-1/3})$ for non-convex functions with the authors' newly designed learning rate strategy.

Abstract

This paper explores adaptive variance reduction methods for stochastic optimization based on the STORM technique. Existing adaptive extensions of STORM rely on strong assumptions like bounded gradients and bounded function values, or suffer an additional $\mathcal{O}(\log T)$ term in the convergence rate. To address these limitations, we introduce a novel adaptive STORM method that achieves an optimal convergence rate of $\mathcal{O}(T^{-1/3})$ for non-convex functions with our newly designed learning rate strategy. Compared with existing approaches, our method requires weaker assumptions and attains the optimal convergence rate without the additional $\mathcal{O}(\log T)$ term. We also extend the proposed technique to stochastic compositional optimization, obtaining the same optimal rate of $\mathcal{O}(T^{-1/3})$. Furthermore, we investigate the non-convex finite-sum problem and develop another innovative adaptive variance reduction method that achieves an optimal convergence rate of $\mathcal{O}(n^{1/4} T^{-1/2} )$, where $n$ represents the number of component functions. Numerical experiments across various tasks validate the effectiveness of our method.

Adaptive Variance Reduction for Stochastic Optimization under Weaker Assumptions

TL;DR

A novel adaptive STORM method is introduced that achieves an optimal convergence rate of

for non-convex functions with the authors' newly designed learning rate strategy.

Abstract

term in the convergence rate. To address these limitations, we introduce a novel adaptive STORM method that achieves an optimal convergence rate of

for non-convex functions with our newly designed learning rate strategy. Compared with existing approaches, our method requires weaker assumptions and attains the optimal convergence rate without the additional

term. We also extend the proposed technique to stochastic compositional optimization, obtaining the same optimal rate of

. Furthermore, we investigate the non-convex finite-sum problem and develop another innovative adaptive variance reduction method that achieves an optimal convergence rate of

, where

represents the number of component functions. Numerical experiments across various tasks validate the effectiveness of our method.

Paper Structure (19 sections, 14 theorems, 116 equations, 3 figures, 1 table, 4 algorithms)

This paper contains 19 sections, 14 theorems, 116 equations, 3 figures, 1 table, 4 algorithms.

Introduction
Related work
Stochastic variance reduction methods
Adaptive stochastic algorithms
Adaptive variance reduction for non-convex optimization
Assumptions
The proposed method
The doubling trick
Extension to stochastic compositional optimization
Adaptive variance reduction for finite-sum optimization
Experiments
Image classification task
Language modeling task
Conclusion
Proof of Theorem \ref{['thm:main_0']}
...and 4 more sections

Key Result

Theorem 1

Under Assumptions ass1, ass2 and ass3, Algorithm alg:storm with hyper-parameters in equation (main:eq) guarantees that:

Figures (3)

Figure 1: Results for CIFAR-10 dataset.
Figure 2: Results for CIFAR-100 dataset.
Figure 3: Results for WikiText-2 dataset.

Theorems & Definitions (23)

Theorem 1
Theorem 2
Theorem 3
Theorem 4
Theorem 5
Lemma 1
Proof 1
Lemma 2
Proof 2
Lemma 3
...and 13 more

Adaptive Variance Reduction for Stochastic Optimization under Weaker Assumptions

TL;DR

Abstract

Adaptive Variance Reduction for Stochastic Optimization under Weaker Assumptions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (23)