Probabilistic Guarantees of Stochastic Recursive Gradient in Non-Convex Finite Sum Problems

Yanjie Zhong; Jiaqi Li; Soumendra Lahiri

Probabilistic Guarantees of Stochastic Recursive Gradient in Non-Convex Finite Sum Problems

Yanjie Zhong, Jiaqi Li, Soumendra Lahiri

TL;DR

This work addresses non-convex finite-sum optimization by introducing Prob-SARAH, a SARAH-based variance-reduced method accompanied by a novel dimension-free Azuma–Hoeffding-type bound for martingale differences with random bounds. The authors establish high-probability bounds on the gradient estimator and derive a near-optimal in-probability complexity $Comp(\varepsilon,\delta)=\tilde{O}_{L,\Delta_f,\alpha_M}(1/\varepsilon^3 \wedge \sqrt{n}/\varepsilon^2)$, while introducing the notion of $\varepsilon$-semi-independence. The key methodological contribution is the new concentration inequality, enabling rigorous probabilistic analysis of SARAH-style updates in the non-convex setting, supported by experiments on logistic regression with non-convex regularization and a two-layer neural network. The results have practical impact by providing strong probabilistic guarantees and robust performance in real-world non-convex finite-sum problems, with potential applicability to broader SARAH-family algorithms.

Abstract

This paper develops a new dimension-free Azuma-Hoeffding type bound on summation norm of a martingale difference sequence with random individual bounds. With this novel result, we provide high-probability bounds for the gradient norm estimator in the proposed algorithm Prob-SARAH, which is a modified version of the StochAstic Recursive grAdient algoritHm (SARAH), a state-of-art variance reduced algorithm that achieves optimal computational complexity in expectation for the finite sum problem. The in-probability complexity by Prob-SARAH matches the best in-expectation result up to logarithmic factors. Empirical experiments demonstrate the superior probabilistic performance of Prob-SARAH on real datasets compared to other popular algorithms.

Probabilistic Guarantees of Stochastic Recursive Gradient in Non-Convex Finite Sum Problems

TL;DR

, while introducing the notion of

-semi-independence. The key methodological contribution is the new concentration inequality, enabling rigorous probabilistic analysis of SARAH-style updates in the non-convex setting, supported by experiments on logistic regression with non-convex regularization and a two-layer neural network. The results have practical impact by providing strong probabilistic guarantees and robust performance in real-world non-convex finite-sum problems, with potential applicability to broader SARAH-family algorithms.

Abstract

Paper Structure (26 sections, 20 theorems, 130 equations, 3 figures, 1 algorithm)

This paper contains 26 sections, 20 theorems, 130 equations, 3 figures, 1 algorithm.

Introduction
Related Works
Our Contributions
Notation
Prob-SARAH Algorithm
Theoretical Results
Technical Assumptions
Main Results on Complexity
Proof Sketch
Numerical Experiments
Logistic Regression with Non-Convex Regularization
Two-Layer Neural Network
Conclusion
Remarks and Examples for Assumptions
More comments on Assumptions \ref{['assump:minimumavailable']}--\ref{['assump:technical']}
...and 11 more sections

Key Result

Theorem 3.1

Suppose that Assumptions assump:minimumavailable, assump:smooth, assump:extent and assump:technical are valid. Given a pair of errors $(\varepsilon,\delta)$, in Algorithm algo1 (Prob-SARAH), set hyperparameters for $j\ge1$, where Then, where $Comp(\varepsilon,\delta)$ represents the number of computations needed to get an output $\hat{\mathbf x}$ satisfying $\left\|\nabla f(\hat{\mathbf x})\rig

Figures (3)

Figure 1: Comparison of convergence with respect to $(1-\delta)$-quantile of square of gradient norm $\left( \|\nabla f\|^2\right)$ and $\delta$-quantile of validation accuracy on the MNIST dataset for $\delta=0.1$ and $\delta=0.01$. The second (fourth) column presents zoom-in figures of those in the first (third) column. Top: $\delta=0.1$. Bottom: $\delta=0.01$. 'bs' stands for batch size. 'sj=x' means that the smallest batch size $\approx x\log x$.
Figure 2: Comparison of convergence with respect to $(1-\delta)$-quantile of square of gradient norm $\left(\|\nabla f\|^2\right)$ over 3 datasets for $\delta=0.1$ and $\delta=0.01$. Top: $\delta=0.1$. Bottom: $\delta=0.01$. Datasets: mushrooms, ijcnn1, w7a (from left to right). 'bs' stands for batch size.
Figure 3: Comparison of convergence with respect to $(1-\delta)$-quantile of square of gradient norm $\left( \|\nabla f\|^2\right)$ and $\delta$-quantile of validation accuracy on the MNIST dataset for $\delta=0.1$ and $\delta=0.01$. The second (fourth) column presents zoom-in figures of those in the first (third) column. Top: $\delta=0.1$. Bottom: $\delta=0.01$. 'bs' stands for batch size. 'sj=x' means that the smallest batch size $\approx x\log x$.

Theorems & Definitions (39)

Definition 3.1: $\varepsilon$-semi-independence
Theorem 3.1
Theorem 3.2: Martingale Azuma-Hoeffding Inequality with Random Bounds
Remark 3.1
Remark A.1: Convexity and smoothness
Remark A.2: Compact set $\mathcal{D}$
Proposition B.1: Stop guarantee of Prob-SARAH
Theorem C.1
Corollary C.1
Theorem C.2
...and 29 more

Probabilistic Guarantees of Stochastic Recursive Gradient in Non-Convex Finite Sum Problems

TL;DR

Abstract

Probabilistic Guarantees of Stochastic Recursive Gradient in Non-Convex Finite Sum Problems

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (39)