Initialisation and Network Effects in Decentralised Federated Learning

Arash Badie-Modiri; Chiara Boldrini; Lorenzo Valerio; János Kertész; Márton Karsai

Initialisation and Network Effects in Decentralised Federated Learning

Arash Badie-Modiri, Chiara Boldrini, Lorenzo Valerio, János Kertész, Márton Karsai

TL;DR

The paper tackles the challenge that fully decentralised federated learning performance hinges on network topology and initial parameter conditions. It proposes a topology-aware uncoordinated initialisation based on the distribution of eigenvector centralities, corrected by the steady-state norm $||v_{steady}||$, to prevent detrimental parameter compression from repeated neighbourhood averaging. A simple numerical model and Markov-chain analysis link early dynamics to gossip-like aggregation and mixing times, yielding practical formulas and exact/approximate gain-correction strategies that make decentralised training approach the efficiency of a centralised baseline given the same total data. Empirical validation across multiple datasets and architectures demonstrates robustness to topology and partial connectivity, with insights into how network density, data per node, system size, and communication frequency influence scalability. The work provides design guidance for scalable, uncoordinated distributed training and highlights the critical role of network structure in learning dynamics, while acknowledging limitations and avenues for future extension.

Abstract

Fully decentralised federated learning enables collaborative training of individual machine learning models on a distributed network of communicating devices while keeping the training data localised on each node. This approach avoids central coordination, enhances data privacy and eliminates the risk of a single point of failure. Our research highlights that the effectiveness of decentralised federated learning is significantly influenced by the network topology of connected devices and the learning models' initial conditions. We propose a strategy for uncoordinated initialisation of the artificial neural networks based on the distribution of eigenvector centralities of the underlying communication network, leading to a radically improved training efficiency. Additionally, our study explores the scaling behaviour and the choice of environmental parameters under our proposed initialisation strategy. This work paves the way for more efficient and scalable artificial neural network training in a distributed and uncoordinated environment, offering a deeper understanding of the intertwining roles of network structure and learning dynamics.

Initialisation and Network Effects in Decentralised Federated Learning

TL;DR

, to prevent detrimental parameter compression from repeated neighbourhood averaging. A simple numerical model and Markov-chain analysis link early dynamics to gossip-like aggregation and mixing times, yielding practical formulas and exact/approximate gain-correction strategies that make decentralised training approach the efficiency of a centralised baseline given the same total data. Empirical validation across multiple datasets and architectures demonstrates robustness to topology and partial connectivity, with insights into how network density, data per node, system size, and communication frequency influence scalability. The work provides design guidance for scalable, uncoordinated distributed training and highlights the critical role of network structure in learning dynamics, while acknowledging limitations and avenues for future extension.

Abstract

Paper Structure (21 sections, 4 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 21 sections, 4 equations, 10 figures, 2 tables, 1 algorithm.

Introduction
Motivation
Contribution
Related works
Preliminaries
System model and notation
Experimental setup
Uncoordinated initialisation of artificial neural networks
Motivating example
Numerical model of early-stage dynamics
The compression of node parameters
Estimating parameter scaling factor ||v_steady||
Initial stabilisation time
Scalability and role of learning parameters
Network density
...and 6 more sections

Figures (10)

Figure 1: Illustration of matrix $W$, $w_{*,i}$, $w_{j,*}$, $\sigma_{ap}$ and $\sigma_{an}$. Each column of the matrix $W$ represents parameters of a single node, and each row represents the values of the same parameter on different nodes. $\sigma_{an}$ and $\sigma_{ap}$ can therefore be defined as the mean of the standard deviations of rows and columns of $W$, respectively.
Figure 2: A comparison between (a) a typical centralised federated learning setup where nodes communicate only through a central server, (b) a typical decentralised but coordinated federated learning setup where nodes communicate directly with their peers, but a central server still plays a role in coordinating the setup and (c) a fully uncoordinated decentralised federated learning setup, where no coordination through a centralised server is necessary. Of particular interest to this work is the initialisation of learning model parameters, displayed using colours in this schematic. In centralised and coordinated cases, the coordinating server can ensure every node receives the same set of initial parameters, shown here using the colour of each node, while in the fully uncoordinated setting, we cannot make such assumptions.
Figure 3: Mean test cross-entropy loss with the proposed initialisation (solid lines) compared to the initialisation method proposed in he2015delving without re-scaling (dashed lines). The decentralised federated learning process on nodes connected through (a,b) fully-connected (complete) networks with MNIST classification task on a simple multilayer perceptron with iid data distribution (c,d) Barabási--Albert networks with average degree 4, with the So2Sat LCZ42 classification task, using a simple convolutional architecture, Zipf data distribution, $\alpha=1.8$, (e,f) random 4-regular networks with CIFAR-10 classification task with VGG16 architecture and (g,h) same configuration as (a,b) but using Adam optimiser with decoupled weight decay. The results show that without the proposed re-scaling of the parameters, the mean test loss has a plateau that lasts a number of rounds proportional to (or sub-linear in) the system size. Bottom row (b,d,f,h) shows the empirical scaling of the test loss time trajectory of the independent he2015delving method initialisation with system size, with exponents ranging from 0.4 to 1. Error bars represent 95% confidence intervals.
Figure 4: Mean cross-entropy test loss as a function of communication rounds for a fully connected communication network $n=64$ using the MNIST dataset with 512 items per node. Each (a) connection or (b) node is active at each round with probability $p$. Note that inactive nodes still perform local training, but are in effect momentarily isolated from the network. The proposed initialisation is displayed with solid lines and the independent initialisation method of he2015delving with dashed lines. Even at fairly low values of $p$, the system as a whole has a much better overall learning trajectory with our proposed parameter initialisation method compared to that of he2015delving. Error bars represent 95% confidence intervals.
Figure 5: (a) Mean magnitude of change in parameters due to training and aggregation independently as well as the total change, as well as the mean cosine similarity of the changes during training and aggregation. In the early rounds of the iterative process, the vector of change due to the aggregation is several orders of magnitude larger than that of the training. Additionally, the cosine similarity trajectory indicates the orthogonality of these vectors in the early rounds, supporting the numeric model assumption that the early evolution of the system is dominated by the aggregation step. Additionally, the evolution of standard deviation of $\sigma_{an}$ and $\sigma_{ap}$ on (b) the distributed learning process with actual ANNs and (c) the numerical simplified model shows similar early-stage dynamics. Values were calculated by (a,b) running or (c) numerically modelling the decentralised federated learning process on random 32-regular $n=256$ networks. Panels (a,b) were performed with 80 training samples per node, 1 epoch per communication round. Error bars represent 95% confidence intervals.
...and 5 more figures

Initialisation and Network Effects in Decentralised Federated Learning

TL;DR

Abstract

Initialisation and Network Effects in Decentralised Federated Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (10)