Self-Regulating Random Walks for Resilient Decentralized Learning on Graphs

Maximilian Egger; Rawad Bitar; Ghadir Ayache; Antonia Wachter-Zeh; Salim El Rouayheb

Self-Regulating Random Walks for Resilient Decentralized Learning on Graphs

Maximilian Egger, Rawad Bitar, Ghadir Ayache, Antonia Wachter-Zeh, Salim El Rouayheb

TL;DR

This work tackles the resilience problem of random-walk-based decentralized learning on graphs under arbitrary RW failures. It introduces two decentralized algorithms, DecAFork and DecAFork+, which use a distributed return-time estimator to keep the number of active RWs $\mathsf{Z}_t$ close to a target $\mathsf{Z}_0$, with DecAFork+ adding deliberate terminations to curb overshoot. The authors provide theoretical guarantees, including asymptotic unbiasedness of the estimator, bounds on reaction time and overshoot, and finite RW counts, and validate the approach with extensive simulations across failure models and graph types. The results demonstrate robust, scalable resilience for RW-based decentralized learning without a central coordinator, enabling reliable operation in adversarial or unreliable network conditions.

Abstract

Consider the setting of multiple random walks (RWs) on a graph executing a certain computational task. For instance, in decentralized learning via RWs, a model is updated at each iteration based on the local data of the visited node and then passed to a randomly chosen neighbor. RWs can fail due to node or link failures. The goal is to maintain a desired number of RWs to ensure failure resilience. Achieving this is challenging due to the lack of a central entity to track which RWs have failed to replace them with new ones by forking (duplicating) surviving ones. Without duplications, the number of RWs will eventually go to zero, causing a catastrophic failure of the system. We propose two decentralized algorithms called DecAFork and DecAFork+ that can maintain the number of RWs in the graph around a desired value even in the presence of arbitrary RW failures. Nodes continuously estimate the number of surviving RWs by estimating their return time distribution and fork the RWs when failures are likely to happen. DecAFork+ additionally allows terminations to avoid overloading the network by forking too many RWs. We present extensive numerical simulations that show the performance of DecAFork and DecAFork+ regarding fast detection and reaction to failures compared to a baseline, and establish theoretical guarantees on the performance of both algorithms.

Self-Regulating Random Walks for Resilient Decentralized Learning on Graphs

TL;DR

close to a target

, with DecAFork+ adding deliberate terminations to curb overshoot. The authors provide theoretical guarantees, including asymptotic unbiasedness of the estimator, bounds on reaction time and overshoot, and finite RW counts, and validate the approach with extensive simulations across failure models and graph types. The results demonstrate robust, scalable resilience for RW-based decentralized learning without a central coordinator, enabling reliable operation in adversarial or unreliable network conditions.

Abstract

Paper Structure (15 sections, 16 theorems, 27 equations, 6 figures, 3 algorithms)

This paper contains 15 sections, 16 theorems, 27 equations, 6 figures, 3 algorithms.

Introduction
System Model
Main Results: DecAFork & DecAFork+
MissingPerson - A Baseline Method
DecAFork: Robustness by Careful Forking
DecAFork+ : Faster Reaction by Deliberate Termination
Analyzing DecAFork: Reaction vs. Overshoot
On the Average of the Estimator $\hat{\theta}_i(t)$
On the Distribution of the Estimator $\hat{\theta}_i(t)$
A Bound on the Reaction Time to Failure Events
The Number of Random Walks is Finite
Bounding the Overshoot after Failures
Bounding the Terminations in DecAFork+
Numerical Experiments
Conclusion

Key Result

Proposition 1

Under assumption:distributions and replacing the empirical distribution of $R_{i}$ by its analytical counterpart, the estimator $\hat{\theta}_i(t)$ satisfies $2\mathrm{E}[\hat{\theta}_i(t)] = Z_t$ for infinitely long active random walks, i.e., no forks and terminations.

Figures (6)

Figure 1: Performance of MissingPerson, DecAFork and DecAFork+ in maintaining the number of random walks (RWs) $Z_t$ around a desired value $Z_0=10$. $\mathcal{G}$ is a random $8$-degree regular graph with $n = 100$ nodes. We induce two burst failure events at $t=2000$ and $t=6000$ where multiple RWs fail simultaneously. MissingPerson over-reacts to the failure events by over-forking, whereas DecAFork ($\varepsilon=2$) reacts faster and only forks RWs until $Z_t$ stabilizes around $Z_0$. DecAFork+ ($\varepsilon=3.25, \varepsilon_2=5.75$) can react faster by terminating RWs if $Z_t$ exceeds $Z_0$. The Standard deviations over $50$ simulation runs are depicted by shaded areas.
Figure 2: Performance of DecAFork and DecAFork+ for a random $8$-degree regular graph with $n = 100$ nodes. In addition to two burst failure events at $t=2000$ and $t=6000$, each RW can independently fail with probability $p_f$ at each time step. The parameters $\varepsilon$ and $\varepsilon_2$ are chosen to stabilize around $Z_0$ for $p_f=0$, i.e., as in \ref{['fig:catastrophic_failures']}. DecAFork+ exhibits a stable performance for different values of $p_f$. DecAFork can successfully recover from burst failures but does not attain the target redundancy of $Z_0$ due to continuous probabilistic failures that outweigh the forks.
Figure 3: Performance of DecAFork and DecAFork+ for a random $8$-degree regular graph with $n = 100$ nodes. In addition to two burst failure events at $t=2000$ and $t=6000$, one dedicated node deterministically fails each incoming RW (Byz). The challenge is to cope with Byzantine nodes that can suddenly stop terminating RWs, i.e., behaving honestly (No Byz). Only DecAFork+ can cope with this extreme failure model.
Figure 4: Consistent performance of DecAFork for random $8$-degree regular graphs with different numbers of nodes $n \in \{50, 100, 200\}$ and $Z_0=10$.
Figure 5: DecAFork on a $8$-degree random regular graph with $n=100$. Different choices for $\varepsilon$ show the trade-off between reaction time and undesired forks beyond $Z_0=10$.
...and 1 more figures

Theorems & Definitions (27)

Proposition 1
proof
Lemma 1
Corollary 1
Theorem 1
proof
Proposition 2
Lemma 2
proof
Proposition 3
...and 17 more

Self-Regulating Random Walks for Resilient Decentralized Learning on Graphs

TL;DR

Abstract

Self-Regulating Random Walks for Resilient Decentralized Learning on Graphs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (27)