Table of Contents
Fetching ...

Accelerating Distributed Stochastic Optimization via Self-Repellent Random Walks

Jie Hu, Vishwaraj Doshi, Do Young Eun

TL;DR

This work tackles distributed stochastic optimization where gradient information is accessed through a token performing a nonlinear SRRW on a graph. By embedding the SRRW kernel into stochastic approximation (SA-SRRW), the authors derive a CLT-based analysis showing that the asymptotic covariance of optimization iterates is strictly reduced compared to base Markov-chain driven SA, with a main gain scaling as $O(1/\alpha^2)$ for the favorable two-timescale regime. They provide explicit covariance forms, including $\mathbf{V}_{\mathbf{x}}(\alpha)$ and Lyapunov-based expressions for the parameter covariance, and demonstrate that SRRW yields meaningful variance reduction across multiple datasets and tasks, while in certain regimes (e.g., same-timescale or particular limits) the gain may vanish. Empirical results on real graphs corroborate the theory, showing faster convergence and reduced variance as $\alpha$ increases, validating the practicality of SRRW-driven token algorithms for scalable decentralized learning.

Abstract

We study a family of distributed stochastic optimization algorithms where gradients are sampled by a token traversing a network of agents in random-walk fashion. Typically, these random-walks are chosen to be Markov chains that asymptotically sample from a desired target distribution, and play a critical role in the convergence of the optimization iterates. In this paper, we take a novel approach by replacing the standard linear Markovian token by one which follows a nonlinear Markov chain - namely the Self-Repellent Radom Walk (SRRW). Defined for any given 'base' Markov chain, the SRRW, parameterized by a positive scalar α, is less likely to transition to states that were highly visited in the past, thus the name. In the context of MCMC sampling on a graph, a recent breakthrough in Doshi et al. (2023) shows that the SRRW achieves O(1/α) decrease in the asymptotic variance for sampling. We propose the use of a 'generalized' version of the SRRW to drive token algorithms for distributed stochastic optimization in the form of stochastic approximation, termed SA-SRRW. We prove that the optimization iterate errors of the resulting SA-SRRW converge to zero almost surely and prove a central limit theorem, deriving the explicit form of the resulting asymptotic covariance matrix corresponding to iterate errors. This asymptotic covariance is always smaller than that of an algorithm driven by the base Markov chain and decreases at rate O(1/α^2) - the performance benefit of using SRRW thereby amplified in the stochastic optimization context. Empirical results support our theoretical findings.

Accelerating Distributed Stochastic Optimization via Self-Repellent Random Walks

TL;DR

This work tackles distributed stochastic optimization where gradient information is accessed through a token performing a nonlinear SRRW on a graph. By embedding the SRRW kernel into stochastic approximation (SA-SRRW), the authors derive a CLT-based analysis showing that the asymptotic covariance of optimization iterates is strictly reduced compared to base Markov-chain driven SA, with a main gain scaling as for the favorable two-timescale regime. They provide explicit covariance forms, including and Lyapunov-based expressions for the parameter covariance, and demonstrate that SRRW yields meaningful variance reduction across multiple datasets and tasks, while in certain regimes (e.g., same-timescale or particular limits) the gain may vanish. Empirical results on real graphs corroborate the theory, showing faster convergence and reduced variance as increases, validating the practicality of SRRW-driven token algorithms for scalable decentralized learning.

Abstract

We study a family of distributed stochastic optimization algorithms where gradients are sampled by a token traversing a network of agents in random-walk fashion. Typically, these random-walks are chosen to be Markov chains that asymptotically sample from a desired target distribution, and play a critical role in the convergence of the optimization iterates. In this paper, we take a novel approach by replacing the standard linear Markovian token by one which follows a nonlinear Markov chain - namely the Self-Repellent Radom Walk (SRRW). Defined for any given 'base' Markov chain, the SRRW, parameterized by a positive scalar α, is less likely to transition to states that were highly visited in the past, thus the name. In the context of MCMC sampling on a graph, a recent breakthrough in Doshi et al. (2023) shows that the SRRW achieves O(1/α) decrease in the asymptotic variance for sampling. We propose the use of a 'generalized' version of the SRRW to drive token algorithms for distributed stochastic optimization in the form of stochastic approximation, termed SA-SRRW. We prove that the optimization iterate errors of the resulting SA-SRRW converge to zero almost surely and prove a central limit theorem, deriving the explicit form of the resulting asymptotic covariance matrix corresponding to iterate errors. This asymptotic covariance is always smaller than that of an algorithm driven by the base Markov chain and decreases at rate O(1/α^2) - the performance benefit of using SRRW thereby amplified in the stochastic optimization context. Empirical results support our theoretical findings.
Paper Structure (31 sections, 24 theorems, 211 equations, 7 figures)

This paper contains 31 sections, 24 theorems, 211 equations, 7 figures.

Key Result

Lemma 3.1

Under Assumptions assump:1, assump:3 and assump:4, for the SRRW iterates eqn:SRRW_iteration, we have Moreover, for all $\alpha_2>\alpha_1>0$, we have ${\mathbf{V}}_{\mathbf{x}}(\alpha_2) <_L {\mathbf{V}}_{\mathbf{x}}(\alpha_1) <_L {\mathbf{V}}_{\mathbf{x}}(0)$.

Figures (7)

  • Figure 1: Visualization of token algorithms using SRRW versus traditional MC in distributed learning. Our CLT analysis, extended from SRRW itself to distributed stochastic approximation, leads to near-zero variance for the SA iteration ${\bm{\theta}}_n$. Node numbers on the left denote visit counts.
  • Figure 2: Simulation results under case (i): (a) and (b) show the performance of SGD-SRRW and SHB-SRRW for various $\alpha$ values. (c) shows that MSE decreases at $O(1/\alpha^2)$ speed.
  • Figure 3: Comparison of the performance among cases (i) - (iii) for $\alpha \in \{1,5,10\}$.
  • Figure 4: Simulation results with various $\alpha$ values in a9a and splice datasets.
  • Figure 5: Performance comparison among cases (i) - (iii) for $\alpha \in \{5,10,20\}$ in a9a and splice datasets.
  • ...and 2 more figures

Theorems & Definitions (30)

  • Lemma 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Proposition 3.4
  • Corollary 3.5
  • Remark E.1
  • Lemma E.1
  • proof
  • Lemma E.2
  • proof
  • ...and 20 more