High Probability Guarantees for Random Reshuffling
Hengxu Yu, Xiao Li
TL;DR
The paper addresses nonconvex finite-sum optimization with stochastic gradient methods by focusing on random reshuffling (RR). It introduces a concentration framework for sampling without replacement and derives high-probability first- and second-order guarantees for RR, including a computable stopping criterion RR-sc and a perturbed variant p-RR that escapes strict saddles. The main contributions are a first-order complexity bound of $\tilde{O}(\max\{\sqrt{n}\varepsilon^{-3}, n\varepsilon^{-2}\})$ and a second-order bound of $\tilde{O}(\max\{\sqrt{n}\varepsilon^{-3}, n\varepsilon^{-5/2}\})$ under standard smoothness and Hessian Lipschitz assumptions, all without sub-Gaussian gradient error requirements. Numerical experiments on neural network training support the theory and illustrate practical stopping behavior and gradient concentration; the concentration tool may be of independent interest for RR-type analyses in other settings.
Abstract
We consider the stochastic gradient method with random reshuffling ($\mathsf{RR}$) for tackling smooth nonconvex optimization problems. $\mathsf{RR}$ finds broad applications in practice, notably in training neural networks. In this work, we provide high probability first-order and second-order complexity guarantees for this method. First, we establish a high probability first-order sample complexity result for driving the Euclidean norm of the gradient (without taking expectation) below $\varepsilon$. Our derived complexity matches the best existing in-expectation one up to a logarithmic term while imposing no additional assumptions nor changing $\mathsf{RR}$'s updating rule. We then propose a simple and computable stopping criterion for $\mathsf{RR}$ (denoted as $\mathsf{RR}$-$\mathsf{sc}$). This criterion is guaranteed to be triggered after a finite number of iterations, enabling us to prove a high probability first-order complexity guarantee for the last iterate. Second, building on the proposed stopping criterion, we design a perturbed random reshuffling method ($\mathsf{p}$-$\mathsf{RR}$) that involves an additional randomized perturbation procedure near stationary points. We derive that $\mathsf{p}$-$\mathsf{RR}$ provably escapes strict saddle points and establish a high probability second-order complexity result, without requiring any sub-Gaussian tail-type assumptions on the stochastic gradient errors. The fundamental ingredient in deriving the aforementioned results is the new concentration property for sampling without replacement in $\mathsf{RR}$, which could be of independent interest. Finally, we conduct numerical experiments on neural network training to support our theoretical findings.
