Vanishing Variance Problem in Fully Decentralized Neural-Network Systems

Yongding Tian; Zaid Al-Ars; Maksim Kitsak; Peter Hofstee

Vanishing Variance Problem in Fully Decentralized Neural-Network Systems

Yongding Tian, Zaid Al-Ars, Maksim Kitsak, Peter Hofstee

TL;DR

This work identifies a vanishing variance problem when averaging uncorrelated neural networks in fully decentralized gossip learning, which delays convergence. It proposes a variance-corrected model averaging method that layer-wise preserves the Xavier variances of input models, making gossip learning converge as efficiently as federated learning under both IID and non-IID data. The approach demonstrates substantial performance gains over standard gossip learning and competing variants, achieving up to 6x faster convergence in large-scale networks. The results imply that variance-aware aggregation can unlock practical, privacy-preserving decentralized learning with FL-like efficiency.

Abstract

Federated learning and gossip learning are emerging methodologies designed to mitigate data privacy concerns by retaining training data on client devices and exclusively sharing locally-trained machine learning (ML) models with others. The primary distinction between the two lies in their approach to model aggregation: federated learning employs a centralized parameter server, whereas gossip learning adopts a fully decentralized mechanism, enabling direct model exchanges among nodes. This decentralized nature often positions gossip learning as less efficient compared to federated learning. Both methodologies involve a critical step: computing a representation of received ML models and integrating this representation into the existing model. Conventionally, this representation is derived by averaging the received models, exemplified by the FedAVG algorithm. Our findings suggest that this averaging approach inherently introduces a potential delay in model convergence. We identify the underlying cause and refer to it as the "vanishing variance" problem, where averaging across uncorrelated ML models undermines the optimal variance established by the Xavier weight initialization. Unlike federated learning where the central server ensures model correlation, and unlike traditional gossip learning which circumvents this problem through model partitioning and sampling, our research introduces a variance-corrected model averaging algorithm. This novel algorithm preserves the optimal variance needed during model averaging, irrespective of network topology or non-IID data distributions. Our extensive simulation results demonstrate that our approach enables gossip learning to achieve convergence efficiency comparable to that of federated learning.

Vanishing Variance Problem in Fully Decentralized Neural-Network Systems

TL;DR

Abstract

Paper Structure (18 sections, 1 theorem, 4 equations, 8 figures, 1 table, 3 algorithms)

This paper contains 18 sections, 1 theorem, 4 equations, 8 figures, 1 table, 3 algorithms.

Introduction
Background
Federated Learning
Gossip Learning
Xavier Initialization
Plateau Delay in Fully Decentralized Neural-Network Systems
Experiment Setup
Plateau Delay in Gossip Learning
Existing Methods without Plateau Delay
Federated Learning Variant
Weights-Compressed Gossip Learning
Transfer Learning Variant
Vanishing Variance Causes Plateau Delay
Variance-Corrected Model Averaging
Results
...and 3 more sections

Key Result

Proposition 1

Model averaging among uncorrelated neural network models leads to reduced model weights variance, which can impede subsequent model training efforts.

Figures (8)

Figure 1: Label distribution under $\alpha=0.5$ for 8 nodes using the MNIST dataset, which contains 10 distinct labels. The y-axis represents the probability of each label occurring in a node's training set.
Figure 2: Accuracy curves and model weight differences for the baseline network under IID (a,b) and non-IID (c,d) settings. The "plateau delay" is calculated according to Equation \ref{['equation:definition_plateau_delay']}, and "90% accuracy" marks the point when over 90% of nodes achieve an accuracy higher than 0.9. The "conv1", "conv2", "ip1" and "ip2" are the name of four trainable layers in LeNet.
Figure 3: Accuracy curves for the baseline configuration with $Star(N=50)$ topology under IID (a) and non-IID (b) settings.
Figure 4: The accuracy curves for compression ratio=0.01, 0.2 and 0.6 in IID and non-IID settings.
Figure 5: Accuracy curves for the temporal hierarchical network under IID (a) and non-IID (b) settings.
...and 3 more figures

Theorems & Definitions (1)

Proposition 1

Vanishing Variance Problem in Fully Decentralized Neural-Network Systems

TL;DR

Abstract

Vanishing Variance Problem in Fully Decentralized Neural-Network Systems

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (1)