Table of Contents
Fetching ...

Distributed Random Reshuffling Methods with Improved Convergence

Kun Huang, Linli Zhou, Shi Pu

TL;DR

This work tackles decentralized optimization over networks with RR updates by introducing GT-RR and ED-RR, which fuse RR with gradient tracking and exact diffusion to reduce the adverse effect of network connectivity. The authors develop a unified analytical framework and Lyapunov-based proofs to show that these methods achieve convergence rates matching centralized RR up to network-dependent constants, both in general nonconvex settings and under the Polyak-Łojasiewicz condition. Key results include an $O\left( (1-\lambda)^{-1/3} m^{-1/3} T^{-2/3} \right)$ rate for nonconvex objectives and $O\left( (1-\lambda)^{-1} m^{-1} T^{-2} \right)$ under PL, along with improved D-RR bounds. Numerical experiments on distributed estimation and neural network tasks corroborate the theoretical gains, demonstrating superior performance of GT-RR and ED-RR over unshuffled and previous RR methods, especially on better-connected graphs. Overall, the paper provides practical, scalable, and theoretically sound RR-based distributed optimization methods with strong convergence guarantees under realistic network conditions.

Abstract

This paper proposes two distributed random reshuffling methods, namely Gradient Tracking with Random Reshuffling (GT-RR) and Exact Diffusion with Random Reshuffling (ED-RR), to solve the distributed optimization problem over a connected network, where a set of agents aim to minimize the average of their local cost functions. Both algorithms invoke random reshuffling (RR) update for each agent, inherit favorable characteristics of RR for minimizing smooth nonconvex objective functions, and improve the performance of previous distributed random reshuffling methods both theoretically and empirically. Specifically, both GT-RR and ED-RR achieve the convergence rate of $O(1/[(1-λ)^{1/3}m^{1/3}T^{2/3}])$ in driving the (minimum) expected squared norm of the gradient to zero, where $T$ denotes the number of epochs, $m$ is the sample size for each agent, and $1-λ$ represents the spectral gap of the mixing matrix. When the objective functions further satisfy the Polyak-Łojasiewicz (PL) condition, we show GT-RR and ED-RR both achieve $O(1/[(1-λ)mT^2])$ convergence rate in terms of the averaged expected differences between the agents' function values and the global minimum value. Notably, both results are comparable to the convergence rates of centralized RR methods (up to constant factors depending on the network topology) and outperform those of previous distributed random reshuffling algorithms.

Distributed Random Reshuffling Methods with Improved Convergence

TL;DR

This work tackles decentralized optimization over networks with RR updates by introducing GT-RR and ED-RR, which fuse RR with gradient tracking and exact diffusion to reduce the adverse effect of network connectivity. The authors develop a unified analytical framework and Lyapunov-based proofs to show that these methods achieve convergence rates matching centralized RR up to network-dependent constants, both in general nonconvex settings and under the Polyak-Łojasiewicz condition. Key results include an rate for nonconvex objectives and under PL, along with improved D-RR bounds. Numerical experiments on distributed estimation and neural network tasks corroborate the theoretical gains, demonstrating superior performance of GT-RR and ED-RR over unshuffled and previous RR methods, especially on better-connected graphs. Overall, the paper provides practical, scalable, and theoretically sound RR-based distributed optimization methods with strong convergence guarantees under realistic network conditions.

Abstract

This paper proposes two distributed random reshuffling methods, namely Gradient Tracking with Random Reshuffling (GT-RR) and Exact Diffusion with Random Reshuffling (ED-RR), to solve the distributed optimization problem over a connected network, where a set of agents aim to minimize the average of their local cost functions. Both algorithms invoke random reshuffling (RR) update for each agent, inherit favorable characteristics of RR for minimizing smooth nonconvex objective functions, and improve the performance of previous distributed random reshuffling methods both theoretically and empirically. Specifically, both GT-RR and ED-RR achieve the convergence rate of in driving the (minimum) expected squared norm of the gradient to zero, where denotes the number of epochs, is the sample size for each agent, and represents the spectral gap of the mixing matrix. When the objective functions further satisfy the Polyak-Łojasiewicz (PL) condition, we show GT-RR and ED-RR both achieve convergence rate in terms of the averaged expected differences between the agents' function values and the global minimum value. Notably, both results are comparable to the convergence rates of centralized RR methods (up to constant factors depending on the network topology) and outperform those of previous distributed random reshuffling algorithms.
Paper Structure (29 sections, 19 theorems, 77 equations, 4 figures, 2 tables, 3 algorithms)

This paper contains 29 sections, 19 theorems, 77 equations, 4 figures, 2 tables, 3 algorithms.

Key Result

Lemma 1

Let Assumptions as:graph and as:abc hold. Then both GT-RR and ED-RR can be written as where $\mathbf{s}_t^\ell := B(\mathbf{z}_t^\ell - B\mathbf{x}_t^\ell) + \alpha_t A\nabla F(\mathbf{1}(\bar{x}_{t}^{0})^{\intercal})$ for epoch $t = 0, 1, 2,\ldots, T-1$. In particular, letting $A = W$, $B = I-W$, $C= W$, and $\mathbf{z}_t^0 = -W\mathbf{x}_t^0$ recovers the update of GT-RR. Letting $

Figures (4)

  • Figure 1: Illustration of two graph topologies. The spectral gaps $(1-\lambda)$ increases from ring graph to grid graph.
  • Figure 2: Comparison among ED-RR, GT-RR, D-RR, ED, DSGT, SGD, and centralized RR for solving Problem \ref{['eq:ls']}. The stepsizes are set as $1\times 10^{-5}$ for both graphs.
  • Figure 3: Comparison among ED-RR, GT-RR, D-RR, ED, DSGT, SGD, and centralized RR for solving Problem \ref{['eq:ls']}. The stepsizes are set as $1/(500t + 500)$ for both graphs.
  • Figure 4: Comparison among ED-RR, GT-RR, D-RR, ED, DSGT, SGD, and centralized RR for training a neural network on the MNIST dataset using constant stepsizes. The stepsizes are sequentially set as $1 / 2$, $1 / 10$, $1 / 50$, and $1 / 250$.

Theorems & Definitions (44)

  • Lemma 1
  • proof
  • Remark 1
  • Remark 2
  • Lemma 2
  • proof
  • Remark 3
  • Remark 4
  • Lemma 3
  • Lemma 4
  • ...and 34 more