The Entrapment Problem in Random Walk Decentralized Learning

Zonghong Liu; Salim El Rouayheb; Matthew Dwyer

The Entrapment Problem in Random Walk Decentralized Learning

Zonghong Liu, Salim El Rouayheb, Matthew Dwyer

TL;DR

The paper addresses decentralized learning on graphs with heterogeneous data, where naive importance sampling via Metropolis-Hastings (MH) can entrap the random walk in high-importance nodes, slowing convergence. It proposes Metropolis-Hastings with Lévy jumps (MHLJ) to perturb the MH transitions and enable escape from entrapment, and provides a convergence-rate bound that separates the exploration effect (via mixing time) from the jump-induced bias (an error gap). Theoretical results are complemented by simulations on ring, grid, and Watts-Strogatz networks, showing that MHLJ speeds up convergence in sparse, heterogeneous settings and that the error gap can be controlled by adjusting the jump probability $p_J$. Overall, the work offers a principled approach to combine localized decentralized sampling with stochastic optimization while addressing exploration-exploitation trade-offs in networked data settings.

Abstract

This paper explores decentralized learning in a graph-based setting, where data is distributed across nodes. We investigate a decentralized SGD algorithm that utilizes a random walk to update a global model based on local data. Our focus is on designing the transition probability matrix to speed up convergence. While importance sampling can enhance centralized learning, its decentralized counterpart, using the Metropolis-Hastings (MH) algorithm, can lead to the entrapment problem, where the random walk becomes stuck at certain nodes, slowing convergence. To address this, we propose the Metropolis-Hastings with Lévy Jumps (MHLJ) algorithm, which incorporates random perturbations (jumps) to overcome entrapment. We theoretically establish the convergence rate and error gap of MHLJ and validate our findings through numerical experiments.

The Entrapment Problem in Random Walk Decentralized Learning

TL;DR

. Overall, the work offers a principled approach to combine localized decentralized sampling with stochastic optimization while addressing exploration-exploitation trade-offs in networked data settings.

Abstract

Paper Structure (19 sections, 5 theorems, 32 equations, 6 figures, 1 algorithm)

This paper contains 19 sections, 5 theorems, 32 equations, 6 figures, 1 algorithm.

Introduction
Previous Work
Contributions
Organization
Problem Setting
Network and Objective Function
Data Heterogeneity
Random Walk Learning
Importance Sampling
Importance Sampling in Centralized Learning
Importance Sampling in Decentralized Learning
The Entrapment Problem
MHLJ Algorithm
Convergence Result
Proofs
...and 4 more sections

Key Result

Theorem 1

Suppose that each local loss function $f_v$ is $L_v$-smooth and $\mu$-strongly convex, and $\lVert\nabla f_v(x^*)\rVert^2\leq \sigma_*^2,\ \forall v\in V$, then for $\gamma<\min\{\frac{1}{\Bar{L}},\frac{1}{T\mu}\ln{T\frac{\lVert x^0-x^*\rVert^2\mu^2}{\tau_{mix}\sigma^2_*\Bar{L}}}\}$, the output of A where $\tau_{mix}$ is the mixing time of $P=P_{IS}-p_{J}(P_{IS}-P_{\textit{L\'{e}vy}})$, and $\Bar{

Figures (6)

Figure 1: Decentralized learning via random walk. The model $x$ is carried by a random walk, which is represented by the red arrows. The model is updated using local data of the visited node in each iteration.
Figure 2: (a) An example of ring topology with five nodes that may cause the entrapment issue. (b) In the Markov chain representation of the random walk on the graph in (a).
Figure 3: Linear regression model $y =A x +\epsilon$ trained on a synthetic heterogeneous data set over a ring network with 1000 nodes. We compare the uniform sampling, importance sampling, and our Algorithm MHLJ. The $y$-axis is the mean square error (MSE), i.e., $\sum_{v\in V}\lVert y_v - A_v\hat{x}\rVert^2/|V|$. The $x$-axis is the number of iterations with SGD updates, i.e., the number of times \ref{['sgd']} is called. We generate the data $A_v$ on node $v$ with $A_v\overset{\mathrm{i.i.d.}}{\sim} N(0,\sigma^2\mathbb{I}_{10})$, where $\sigma^2$ takes value $1$ with probability $p=0.998$ and $100$ with probability $p=0.002$. The noise is generated from $\epsilon\overset{\mathrm{i.i.d.}}{\sim}N(0,1)$. We use the hyper-parameters: $(p_J,p_d,r)=(0.1,0.5,3)$.
Figure 4: Regression model trained on a synthetic data set over a Erdős-Rényi (1000, 0.1) network with 1000 nodes. We compare the uniform sampling with Metropolis-Hastings transition probability and importance sampling with Metropolis-Hastings transition probability. $\sigma^2_H=100$, $\sigma^2_L=1$. (a) Homogeneous Data. (b) Heterogeneous Data.
Figure 5: Regression model trained on a synthetic heterogeneous data set over sparse networks with 1000 nodes. We compare the uniform sampling with Metropolis-Hastings transition probability, importance sampling with Metropolis- Hastings transition probability and importance sampling with MHLJ. $\sigma^2_H=100$, $\sigma^2_L=1$. (a) 2-d grid. (b) Watts-Strogatz (1000, 4, 0.1) graph.
...and 1 more figures

Theorems & Definitions (9)

Definition 1
Remark 1: Computation v.s. Communication overheads of MHLJ
Theorem 1: Convergence of Algorithm MHLJ
Lemma 1
Lemma 2
Lemma 3
Lemma 4
proof : Proof of Lemma 2
proof : Proof of Theorem 1

The Entrapment Problem in Random Walk Decentralized Learning

TL;DR

Abstract

The Entrapment Problem in Random Walk Decentralized Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (9)