Table of Contents
Fetching ...

High Probability Guarantees for Random Reshuffling

Hengxu Yu, Xiao Li

TL;DR

The paper addresses nonconvex finite-sum optimization with stochastic gradient methods by focusing on random reshuffling (RR). It introduces a concentration framework for sampling without replacement and derives high-probability first- and second-order guarantees for RR, including a computable stopping criterion RR-sc and a perturbed variant p-RR that escapes strict saddles. The main contributions are a first-order complexity bound of $\tilde{O}(\max\{\sqrt{n}\varepsilon^{-3}, n\varepsilon^{-2}\})$ and a second-order bound of $\tilde{O}(\max\{\sqrt{n}\varepsilon^{-3}, n\varepsilon^{-5/2}\})$ under standard smoothness and Hessian Lipschitz assumptions, all without sub-Gaussian gradient error requirements. Numerical experiments on neural network training support the theory and illustrate practical stopping behavior and gradient concentration; the concentration tool may be of independent interest for RR-type analyses in other settings.

Abstract

We consider the stochastic gradient method with random reshuffling ($\mathsf{RR}$) for tackling smooth nonconvex optimization problems. $\mathsf{RR}$ finds broad applications in practice, notably in training neural networks. In this work, we provide high probability first-order and second-order complexity guarantees for this method. First, we establish a high probability first-order sample complexity result for driving the Euclidean norm of the gradient (without taking expectation) below $\varepsilon$. Our derived complexity matches the best existing in-expectation one up to a logarithmic term while imposing no additional assumptions nor changing $\mathsf{RR}$'s updating rule. We then propose a simple and computable stopping criterion for $\mathsf{RR}$ (denoted as $\mathsf{RR}$-$\mathsf{sc}$). This criterion is guaranteed to be triggered after a finite number of iterations, enabling us to prove a high probability first-order complexity guarantee for the last iterate. Second, building on the proposed stopping criterion, we design a perturbed random reshuffling method ($\mathsf{p}$-$\mathsf{RR}$) that involves an additional randomized perturbation procedure near stationary points. We derive that $\mathsf{p}$-$\mathsf{RR}$ provably escapes strict saddle points and establish a high probability second-order complexity result, without requiring any sub-Gaussian tail-type assumptions on the stochastic gradient errors. The fundamental ingredient in deriving the aforementioned results is the new concentration property for sampling without replacement in $\mathsf{RR}$, which could be of independent interest. Finally, we conduct numerical experiments on neural network training to support our theoretical findings.

High Probability Guarantees for Random Reshuffling

TL;DR

The paper addresses nonconvex finite-sum optimization with stochastic gradient methods by focusing on random reshuffling (RR). It introduces a concentration framework for sampling without replacement and derives high-probability first- and second-order guarantees for RR, including a computable stopping criterion RR-sc and a perturbed variant p-RR that escapes strict saddles. The main contributions are a first-order complexity bound of and a second-order bound of under standard smoothness and Hessian Lipschitz assumptions, all without sub-Gaussian gradient error requirements. Numerical experiments on neural network training support the theory and illustrate practical stopping behavior and gradient concentration; the concentration tool may be of independent interest for RR-type analyses in other settings.

Abstract

We consider the stochastic gradient method with random reshuffling () for tackling smooth nonconvex optimization problems. finds broad applications in practice, notably in training neural networks. In this work, we provide high probability first-order and second-order complexity guarantees for this method. First, we establish a high probability first-order sample complexity result for driving the Euclidean norm of the gradient (without taking expectation) below . Our derived complexity matches the best existing in-expectation one up to a logarithmic term while imposing no additional assumptions nor changing 's updating rule. We then propose a simple and computable stopping criterion for (denoted as -). This criterion is guaranteed to be triggered after a finite number of iterations, enabling us to prove a high probability first-order complexity guarantee for the last iterate. Second, building on the proposed stopping criterion, we design a perturbed random reshuffling method (-) that involves an additional randomized perturbation procedure near stationary points. We derive that - provably escapes strict saddle points and establish a high probability second-order complexity result, without requiring any sub-Gaussian tail-type assumptions on the stochastic gradient errors. The fundamental ingredient in deriving the aforementioned results is the new concentration property for sampling without replacement in , which could be of independent interest. Finally, we conduct numerical experiments on neural network training to support our theoretical findings.
Paper Structure (17 sections, 18 theorems, 88 equations, 5 figures, 3 algorithms)

This paper contains 17 sections, 18 theorems, 88 equations, 5 figures, 3 algorithms.

Key Result

Lemma 2.1

Let the set $\left\{X_1, \dots , X_{n} \right\}$ be a finite set of symmetric matrices. Suppose that the set is centered (i.e., $\overline{X} = \sum_{i=1}^{n} X_i /n=0$) and has a uniform bounded operator $\ell_2$-norm $\left\lVert X_i \right\rVert_{\operatorname{op}} \le b$, $\forall i$. Suppose fu Here, $\lambda m/n$ is the largest eigenvalue of the matrix $V = \frac{m}{n}\sum_{i=1}^{n} X_i^2$ a

Figures (5)

  • Figure 1: Flowchart of ${\sf p}\text{-}{\sf RR}$.
  • Figure 2: Verification of the Lipschitz gradient and Hessian conditions used in our theoretical developments.
  • Figure 3: Comparison of performance between ${\sf RR }$ and SGD.
  • Figure 4: Statistics of iterations / epochs $t$ of ${\sf RR }$ and SGD for achieving an $\varepsilon$-stationary point (i.e., $\|\nabla f(x_t)\|\leq \varepsilon$) with varying $\varepsilon$.
  • Figure 5: Evolution of $\|g_t\|$, $\|\nabla f(x_t)\|$, and test accuracy of ${\sf RR }$.

Theorems & Definitions (36)

  • Lemma 2.1: without replacement matrix Bernstein's inequality
  • proof
  • Proposition 2.3: concentration property of stochastic gradient errors
  • proof
  • Lemma 2.4: concentration property of stochastic error
  • proof
  • Lemma 3.1: approximate descent property
  • proof
  • Lemma 3.2
  • proof
  • ...and 26 more