Table of Contents
Fetching ...

Why Random Reshuffling Beats Stochastic Gradient Descent

Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo

TL;DR

This work analyzes Random Reshuffling (RR) for minimizing a finite-sum objective, showing that RR with suffix/Polyak-Ruppert averaging and a diminishing stepsize $α_k=Θ(1/k^s)$ ($s∈(1/2,1)$) achieves almost surely a suboptimality rate of $Θ(1/k^{2s})$, faster than SGD's $Ω(1/k)$. The key idea is to treat RR as gradient descent with cycle-dependent gradient errors, decoupling these errors into an independent $O(α_k)$ term and an $O(α_k^2)$ term to apply a law of large numbers on a weighted error sequence; high-probability bounds are also derived. The paper extends the results to smooth component functions with Lipschitz Hessians and introduces a bias-removal variant (De-biased RR, DRR) that can attain $O(1/k^2)$ suboptimality with high probability. A practical DRR algorithm is proposed, including a bias estimation step requiring a Hessian inversion, and experiments illustrate substantial performance gains over RR and SGD, clarifying why without-replacement sampling can outperform traditional SGD in large-scale finite-sum problems.

Abstract

We analyze the convergence rate of the random reshuffling (RR) method, which is a randomized first-order incremental algorithm for minimizing a finite sum of convex component functions. RR proceeds in cycles, picking a uniformly random order (permutation) and processing the component functions one at a time according to this order, i.e., at each cycle, each component function is sampled without replacement from the collection. Though RR has been numerically observed to outperform its with-replacement counterpart stochastic gradient descent (SGD), characterization of its convergence rate has been a long standing open question. In this paper, we answer this question by showing that when the component functions are quadratics or smooth and the sum function is strongly convex, RR with iterate averaging and a diminishing stepsize $α_k=Θ(1/k^s)$ for $s\in (1/2,1)$ converges at rate $Θ(1/k^{2s})$ with probability one in the suboptimality of the objective value, thus improving upon the $Ω(1/k)$ rate of SGD. Our analysis draws on the theory of Polyak-Ruppert averaging and relies on decoupling the dependent cycle gradient error into an independent term over cycles and another term dominated by $α_k^2$. This allows us to apply law of large numbers to an appropriately weighted version of the cycle gradient errors, where the weights depend on the stepsize. We also provide high probability convergence rate estimates that shows decay rate of different terms and allows us to propose a modification of RR with convergence rate ${\cal O}(\frac{1}{k^2})$.

Why Random Reshuffling Beats Stochastic Gradient Descent

TL;DR

This work analyzes Random Reshuffling (RR) for minimizing a finite-sum objective, showing that RR with suffix/Polyak-Ruppert averaging and a diminishing stepsize () achieves almost surely a suboptimality rate of , faster than SGD's . The key idea is to treat RR as gradient descent with cycle-dependent gradient errors, decoupling these errors into an independent term and an term to apply a law of large numbers on a weighted error sequence; high-probability bounds are also derived. The paper extends the results to smooth component functions with Lipschitz Hessians and introduces a bias-removal variant (De-biased RR, DRR) that can attain suboptimality with high probability. A practical DRR algorithm is proposed, including a bias estimation step requiring a Hessian inversion, and experiments illustrate substantial performance gains over RR and SGD, clarifying why without-replacement sampling can outperform traditional SGD in large-scale finite-sum problems.

Abstract

We analyze the convergence rate of the random reshuffling (RR) method, which is a randomized first-order incremental algorithm for minimizing a finite sum of convex component functions. RR proceeds in cycles, picking a uniformly random order (permutation) and processing the component functions one at a time according to this order, i.e., at each cycle, each component function is sampled without replacement from the collection. Though RR has been numerically observed to outperform its with-replacement counterpart stochastic gradient descent (SGD), characterization of its convergence rate has been a long standing open question. In this paper, we answer this question by showing that when the component functions are quadratics or smooth and the sum function is strongly convex, RR with iterate averaging and a diminishing stepsize for converges at rate with probability one in the suboptimality of the objective value, thus improving upon the rate of SGD. Our analysis draws on the theory of Polyak-Ruppert averaging and relies on decoupling the dependent cycle gradient error into an independent term over cycles and another term dominated by . This allows us to apply law of large numbers to an appropriately weighted version of the cycle gradient errors, where the weights depend on the stepsize. We also provide high probability convergence rate estimates that shows decay rate of different terms and allows us to propose a modification of RR with convergence rate .

Paper Structure

This paper contains 14 sections, 13 theorems, 58 equations, 3 figures, 1 algorithm.

Key Result

Theorem 1

Gur2015IncGrad Let Assumption assump-sum-is-str-cvx hold. Let $f_i(x)$ be a quadratic function of the form $f_i(x) = \frac{1}{2} x_i^T P_i x - q_i^T x + r_i$ where $P_i$ is a symmetric $n\times n$ matrix, $q_i \in \mathbb R^n$ is a column vector and $r_i$ is a scalar for $i=1,2,\dots,m$. Suppose Ass where $c$ is the strong convexity constant of the sum function $f(x)$ and

Figures (3)

  • Figure 1: Left panel: Comparison of the histogram of the approximation error $\bar{x}_k - x^*$ of the averaged iterates for RR and SGD after $k=500$ cycles over $10000$ sample paths created for the Example \ref{['exam-one']} with $s=0.75$. Each sample path contains $1000$ gradient computations for both RR and SGD. Right, top panel: Histogram of the scaled approximation error $k^s (\bar{x}_k - x^*)$ for RR iterates which is concentrated around the vertical line in red. Right, bottom panel: Histogram of the scaled approximation error $k^{1/2} (\bar{x}_k - x^*)$ for SGD which has the shape of a standard normal distribution. The vertical blue line passing through the origin is the axis of symmetry for this distribution indicating that this distribution is centered.
  • Figure 2: Comparison of RR, Debiased-RR (DRR) and SGD when component functions are random quadratics with $m=50$, $n=20$ and with simulation time 0.5 seconds over $500$ sample paths. Top, left: Histograms of $\hbox{dist}_k$ for RR, DRR and SGD. Bottom, left: Histograms of $\hbox{dist}_k$ for RR and DRR only (without SGD). Top, right: Histograms of the suboptimality in objective value for RR, DRR and SGD. Bottom, right: Histograms of the suboptimality in objective value for RR and DRR only (without SGD).
  • Figure 3: Comparison of RR, De-biased-RR (DRR) and SGD. The simulation framework and parameters are the same as those in Fig. \ref{['fig-3']} except that the simulation time is 5 seconds instead for each path.

Theorems & Definitions (26)

  • Example 3.2
  • Theorem 1
  • Corollary 4.1
  • Theorem 2
  • Remark 4.2
  • Theorem 3
  • proof
  • Theorem 4
  • proof
  • proof
  • ...and 16 more