High Probability Guarantees for Random Reshuffling

Hengxu Yu; Xiao Li

High Probability Guarantees for Random Reshuffling

Hengxu Yu, Xiao Li

TL;DR

The paper addresses nonconvex finite-sum optimization with stochastic gradient methods by focusing on random reshuffling (RR). It introduces a concentration framework for sampling without replacement and derives high-probability first- and second-order guarantees for RR, including a computable stopping criterion RR-sc and a perturbed variant p-RR that escapes strict saddles. The main contributions are a first-order complexity bound of $\tilde{O}(\max\{\sqrt{n}\varepsilon^{-3}, n\varepsilon^{-2}\})$ and a second-order bound of $\tilde{O}(\max\{\sqrt{n}\varepsilon^{-3}, n\varepsilon^{-5/2}\})$ under standard smoothness and Hessian Lipschitz assumptions, all without sub-Gaussian gradient error requirements. Numerical experiments on neural network training support the theory and illustrate practical stopping behavior and gradient concentration; the concentration tool may be of independent interest for RR-type analyses in other settings.

Abstract

We consider the stochastic gradient method with random reshuffling ($\mathsf{RR}$) for tackling smooth nonconvex optimization problems. $\mathsf{RR}$ finds broad applications in practice, notably in training neural networks. In this work, we provide high probability first-order and second-order complexity guarantees for this method. First, we establish a high probability first-order sample complexity result for driving the Euclidean norm of the gradient (without taking expectation) below $\varepsilon$. Our derived complexity matches the best existing in-expectation one up to a logarithmic term while imposing no additional assumptions nor changing $\mathsf{RR}$'s updating rule. We then propose a simple and computable stopping criterion for $\mathsf{RR}$ (denoted as $\mathsf{RR}$-$\mathsf{sc}$). This criterion is guaranteed to be triggered after a finite number of iterations, enabling us to prove a high probability first-order complexity guarantee for the last iterate. Second, building on the proposed stopping criterion, we design a perturbed random reshuffling method ($\mathsf{p}$-$\mathsf{RR}$) that involves an additional randomized perturbation procedure near stationary points. We derive that $\mathsf{p}$-$\mathsf{RR}$ provably escapes strict saddle points and establish a high probability second-order complexity result, without requiring any sub-Gaussian tail-type assumptions on the stochastic gradient errors. The fundamental ingredient in deriving the aforementioned results is the new concentration property for sampling without replacement in $\mathsf{RR}$, which could be of independent interest. Finally, we conduct numerical experiments on neural network training to support our theoretical findings.

High Probability Guarantees for Random Reshuffling

TL;DR

and a second-order bound of

under standard smoothness and Hessian Lipschitz assumptions, all without sub-Gaussian gradient error requirements. Numerical experiments on neural network training support the theory and illustrate practical stopping behavior and gradient concentration; the concentration tool may be of independent interest for RR-type analyses in other settings.

Abstract

We consider the stochastic gradient method with random reshuffling (

) for tackling smooth nonconvex optimization problems.

finds broad applications in practice, notably in training neural networks. In this work, we provide high probability first-order and second-order complexity guarantees for this method. First, we establish a high probability first-order sample complexity result for driving the Euclidean norm of the gradient (without taking expectation) below

. Our derived complexity matches the best existing in-expectation one up to a logarithmic term while imposing no additional assumptions nor changing

's updating rule. We then propose a simple and computable stopping criterion for

(denoted as

). This criterion is guaranteed to be triggered after a finite number of iterations, enabling us to prove a high probability first-order complexity guarantee for the last iterate. Second, building on the proposed stopping criterion, we design a perturbed random reshuffling method (

) that involves an additional randomized perturbation procedure near stationary points. We derive that

provably escapes strict saddle points and establish a high probability second-order complexity result, without requiring any sub-Gaussian tail-type assumptions on the stochastic gradient errors. The fundamental ingredient in deriving the aforementioned results is the new concentration property for sampling without replacement in

, which could be of independent interest. Finally, we conduct numerical experiments on neural network training to support our theoretical findings.

Paper Structure (17 sections, 18 theorems, 88 equations, 5 figures, 3 algorithms)

This paper contains 17 sections, 18 theorems, 88 equations, 5 figures, 3 algorithms.

Introduction
Our Results
Prior Arts
Concentration Property for Random Reshuffling
A without Replacement Matrix Bernstein's Inequality
High Probability Bounds on Stochastic Gradient Errors
High Probability First-Order Complexity Results
High Probability Sample Complexity
Stopping Criterion and Last Iterate Result
Random Reshuffling with Stopping Criterion
The Last Iterate Result
High Probability Second-Order Complexity Result
Algorithm Design and Our Result
Escaping Saddle Region by Perturbation
Descent Property During Escaping and Proof of \ref{['theo:escape saddle']}
...and 2 more sections

Key Result

Lemma 2.1

Let the set $\left\{X_1, \dots , X_{n} \right\}$ be a finite set of symmetric matrices. Suppose that the set is centered (i.e., $\overline{X} = \sum_{i=1}^{n} X_i /n=0$) and has a uniform bounded operator $\ell_2$-norm $\left\lVert X_i \right\rVert_{\operatorname{op}} \le b$, $\forall i$. Suppose fu Here, $\lambda m/n$ is the largest eigenvalue of the matrix $V = \frac{m}{n}\sum_{i=1}^{n} X_i^2$ a

Figures (5)

Figure 1: Flowchart of ${\sf p}\text{-}{\sf RR}$.
Figure 2: Verification of the Lipschitz gradient and Hessian conditions used in our theoretical developments.
Figure 3: Comparison of performance between ${\sf RR }$ and SGD.
Figure 4: Statistics of iterations / epochs $t$ of ${\sf RR }$ and SGD for achieving an $\varepsilon$-stationary point (i.e., $\|\nabla f(x_t)\|\leq \varepsilon$) with varying $\varepsilon$.
Figure 5: Evolution of $\|g_t\|$, $\|\nabla f(x_t)\|$, and test accuracy of ${\sf RR }$.

Theorems & Definitions (36)

Lemma 2.1: without replacement matrix Bernstein's inequality
proof
Proposition 2.3: concentration property of stochastic gradient errors
proof
Lemma 2.4: concentration property of stochastic error
proof
Lemma 3.1: approximate descent property
proof
Lemma 3.2
proof
...and 26 more

High Probability Guarantees for Random Reshuffling

TL;DR

Abstract

High Probability Guarantees for Random Reshuffling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (36)