Table of Contents
Fetching ...

Probabilistic Guarantees of Stochastic Recursive Gradient in Non-Convex Finite Sum Problems

Yanjie Zhong, Jiaqi Li, Soumendra Lahiri

TL;DR

This work addresses non-convex finite-sum optimization by introducing Prob-SARAH, a SARAH-based variance-reduced method accompanied by a novel dimension-free Azuma–Hoeffding-type bound for martingale differences with random bounds. The authors establish high-probability bounds on the gradient estimator and derive a near-optimal in-probability complexity $Comp(\varepsilon,\delta)=\tilde{O}_{L,\Delta_f,\alpha_M}(1/\varepsilon^3 \wedge \sqrt{n}/\varepsilon^2)$, while introducing the notion of $\varepsilon$-semi-independence. The key methodological contribution is the new concentration inequality, enabling rigorous probabilistic analysis of SARAH-style updates in the non-convex setting, supported by experiments on logistic regression with non-convex regularization and a two-layer neural network. The results have practical impact by providing strong probabilistic guarantees and robust performance in real-world non-convex finite-sum problems, with potential applicability to broader SARAH-family algorithms.

Abstract

This paper develops a new dimension-free Azuma-Hoeffding type bound on summation norm of a martingale difference sequence with random individual bounds. With this novel result, we provide high-probability bounds for the gradient norm estimator in the proposed algorithm Prob-SARAH, which is a modified version of the StochAstic Recursive grAdient algoritHm (SARAH), a state-of-art variance reduced algorithm that achieves optimal computational complexity in expectation for the finite sum problem. The in-probability complexity by Prob-SARAH matches the best in-expectation result up to logarithmic factors. Empirical experiments demonstrate the superior probabilistic performance of Prob-SARAH on real datasets compared to other popular algorithms.

Probabilistic Guarantees of Stochastic Recursive Gradient in Non-Convex Finite Sum Problems

TL;DR

This work addresses non-convex finite-sum optimization by introducing Prob-SARAH, a SARAH-based variance-reduced method accompanied by a novel dimension-free Azuma–Hoeffding-type bound for martingale differences with random bounds. The authors establish high-probability bounds on the gradient estimator and derive a near-optimal in-probability complexity , while introducing the notion of -semi-independence. The key methodological contribution is the new concentration inequality, enabling rigorous probabilistic analysis of SARAH-style updates in the non-convex setting, supported by experiments on logistic regression with non-convex regularization and a two-layer neural network. The results have practical impact by providing strong probabilistic guarantees and robust performance in real-world non-convex finite-sum problems, with potential applicability to broader SARAH-family algorithms.

Abstract

This paper develops a new dimension-free Azuma-Hoeffding type bound on summation norm of a martingale difference sequence with random individual bounds. With this novel result, we provide high-probability bounds for the gradient norm estimator in the proposed algorithm Prob-SARAH, which is a modified version of the StochAstic Recursive grAdient algoritHm (SARAH), a state-of-art variance reduced algorithm that achieves optimal computational complexity in expectation for the finite sum problem. The in-probability complexity by Prob-SARAH matches the best in-expectation result up to logarithmic factors. Empirical experiments demonstrate the superior probabilistic performance of Prob-SARAH on real datasets compared to other popular algorithms.
Paper Structure (26 sections, 20 theorems, 130 equations, 3 figures, 1 algorithm)

This paper contains 26 sections, 20 theorems, 130 equations, 3 figures, 1 algorithm.

Key Result

Theorem 3.1

Suppose that Assumptions assump:minimumavailable, assump:smooth, assump:extent and assump:technical are valid. Given a pair of errors $(\varepsilon,\delta)$, in Algorithm algo1 (Prob-SARAH), set hyperparameters for $j\ge1$, where Then, where $Comp(\varepsilon,\delta)$ represents the number of computations needed to get an output $\hat{\mathbf x}$ satisfying $\left\|\nabla f(\hat{\mathbf x})\rig

Figures (3)

  • Figure 1: Comparison of convergence with respect to $(1-\delta)$-quantile of square of gradient norm $\left( \|\nabla f\|^2\right)$ and $\delta$-quantile of validation accuracy on the MNIST dataset for $\delta=0.1$ and $\delta=0.01$. The second (fourth) column presents zoom-in figures of those in the first (third) column. Top: $\delta=0.1$. Bottom: $\delta=0.01$. 'bs' stands for batch size. 'sj=x' means that the smallest batch size $\approx x\log x$.
  • Figure 2: Comparison of convergence with respect to $(1-\delta)$-quantile of square of gradient norm $\left(\|\nabla f\|^2\right)$ over 3 datasets for $\delta=0.1$ and $\delta=0.01$. Top: $\delta=0.1$. Bottom: $\delta=0.01$. Datasets: mushrooms, ijcnn1, w7a (from left to right). 'bs' stands for batch size.
  • Figure 3: Comparison of convergence with respect to $(1-\delta)$-quantile of square of gradient norm $\left( \|\nabla f\|^2\right)$ and $\delta$-quantile of validation accuracy on the MNIST dataset for $\delta=0.1$ and $\delta=0.01$. The second (fourth) column presents zoom-in figures of those in the first (third) column. Top: $\delta=0.1$. Bottom: $\delta=0.01$. 'bs' stands for batch size. 'sj=x' means that the smallest batch size $\approx x\log x$.

Theorems & Definitions (39)

  • Definition 3.1: $\varepsilon$-semi-independence
  • Theorem 3.1
  • Theorem 3.2: Martingale Azuma-Hoeffding Inequality with Random Bounds
  • Remark 3.1
  • Remark A.1: Convexity and smoothness
  • Remark A.2: Compact set $\mathcal{D}$
  • Proposition B.1: Stop guarantee of Prob-SARAH
  • Theorem C.1
  • Corollary C.1
  • Theorem C.2
  • ...and 29 more