Table of Contents
Fetching ...

SparsyFed: Sparse Adaptive Federated Training

Adriano Guastella, Lorenzo Sani, Alex Iacob, Alessio Mora, Paolo Bellavista, Nicholas D. Lane

TL;DR

SparsyFed tackles the practical challenges of sparse training in cross-device FL by introducing a dynamic masking approach coupled with a sparsity-inducing weight re-parameterization (Powerpropagation). It achieves up to $95\%$ sparsity with negligible accuracy loss and dramatically reduces communication costs, aided by fast mask consensus across clients. The method is agnostic to server optimizers and client selection, requiring only one additional hyperparameter (or a hyperparameter-free variant using a spectral-exponent). Empirical results on CIFAR-10/100 and Speech Commands show superior accuracy to fixed-mask baselines while offering substantial uplink/downlink savings, making sparse FL more viable in heterogeneous, resource-constrained environments.

Abstract

Sparse training is often adopted in cross-device federated learning (FL) environments where constrained devices collaboratively train a machine learning model on private data by exchanging pseudo-gradients across heterogeneous networks. Although sparse training methods can reduce communication overhead and computational burden in FL, they are often not used in practice for the following key reasons: (1) data heterogeneity makes it harder for clients to reach consensus on sparse models compared to dense ones, requiring longer training; (2) methods for obtaining sparse masks lack adaptivity to accommodate very heterogeneous data distributions, crucial in cross-device FL; and (3) additional hyperparameters are required, which are notably challenging to tune in FL. This paper presents SparsyFed, a practical federated sparse training method that critically addresses the problems above. Previous works have only solved one or two of these challenges at the expense of introducing new trade-offs, such as clients' consensus on masks versus sparsity pattern adaptivity. We show that SparsyFed simultaneously (1) can produce 95% sparse models, with negligible degradation in accuracy, while only needing a single hyperparameter, (2) achieves a per-round weight regrowth 200 times smaller than previous methods, and (3) allows the sparse masks to adapt to highly heterogeneous data distributions and outperform all baselines under such conditions.

SparsyFed: Sparse Adaptive Federated Training

TL;DR

SparsyFed tackles the practical challenges of sparse training in cross-device FL by introducing a dynamic masking approach coupled with a sparsity-inducing weight re-parameterization (Powerpropagation). It achieves up to sparsity with negligible accuracy loss and dramatically reduces communication costs, aided by fast mask consensus across clients. The method is agnostic to server optimizers and client selection, requiring only one additional hyperparameter (or a hyperparameter-free variant using a spectral-exponent). Empirical results on CIFAR-10/100 and Speech Commands show superior accuracy to fixed-mask baselines while offering substantial uplink/downlink savings, making sparse FL more viable in heterogeneous, resource-constrained environments.

Abstract

Sparse training is often adopted in cross-device federated learning (FL) environments where constrained devices collaboratively train a machine learning model on private data by exchanging pseudo-gradients across heterogeneous networks. Although sparse training methods can reduce communication overhead and computational burden in FL, they are often not used in practice for the following key reasons: (1) data heterogeneity makes it harder for clients to reach consensus on sparse models compared to dense ones, requiring longer training; (2) methods for obtaining sparse masks lack adaptivity to accommodate very heterogeneous data distributions, crucial in cross-device FL; and (3) additional hyperparameters are required, which are notably challenging to tune in FL. This paper presents SparsyFed, a practical federated sparse training method that critically addresses the problems above. Previous works have only solved one or two of these challenges at the expense of introducing new trade-offs, such as clients' consensus on masks versus sparsity pattern adaptivity. We show that SparsyFed simultaneously (1) can produce 95% sparse models, with negligible degradation in accuracy, while only needing a single hyperparameter, (2) achieves a per-round weight regrowth 200 times smaller than previous methods, and (3) allows the sparse masks to adapt to highly heterogeneous data distributions and outperform all baselines under such conditions.

Paper Structure

This paper contains 52 sections, 8 equations, 13 figures, 8 tables, 2 algorithms.

Figures (13)

  • Figure 1: SparsyFed pipeline. (1) Server broadcasts the global model $\omega_{t}$. (2) Client $i$ re-parameterizes local weights. (3) Executes a forward pass on batch $\mathcal{B}$. (4a) Computes layer-wise sparsity $s_t$. (4b) Prunes activations using $s_t$ and stores them. (5) Computes grads. (6) Applies grads. (7) Computes model updates and applies Top-K pruning. (8) Sends sparse updates $\Delta \tilde{\omega}_{i}^{t}$ back to the server. (9) Apply server optimizer to obtain the global model. Steps (2-6) repeat until convergence.
  • Figure 2: (left) The plot on the left compares accuracy versus communication cost for four implementations: ZeroFL, Top-K, FLASH, and SparsyFed, with the dense model as a reference. The test is conducted on CIFAR-100 partitioned with LDA($\alpha = 0.1$) and $95\%$ sparsity. SparsyFed outperforms the baselines, achieving high accuracy and communicating less. (right) The plot on the right shows the global model sparsity level, measured on the server after aggregating local updates (CIFAR-100, $\alpha = 0.1$). The density gain reflects mismatches between client updates, causing the aggregated model to regain density, which can degrade performance and increase downlink communication. Note: FLASH maintains target sparsity after the first round with a fixed mask.
  • Figure 3: Intersection over Union (IoU) of global model binary masks between training rounds for SparsyFed, Top-K, and ZeroFL (CIFAR-100, $\alpha = 0.1$, 95% target sparsity). The IoU is calculated between each mask and all other masks across rounds to show changes over time. The x and y axes represent training rounds indices--the diagonal indicates the identity. Higher IoU values (close to 1.0) signify stronger similarity between masks, while lower values indicate significant changes. SparsyFed shows consistent mask movement with minimal variation, suggesting strong consensus on weight usage among clients. ZeroFL struggles to find mask consensus, with masks continuing to shift even in later rounds. Note: FLASH is absent since the global mask is fixed.
  • Figure 4: We report the test accuracy of different re-parameterization methods with sparse activations during backpropagation. We deployed a ResNet-18 trained on the CIFAR-10 dataset using LDA$(\alpha=1)$. This plot illustrates the methods' performance under different sparsity levels. Powerpropagation exhibited superior robustness to the applied sparsity levels, achieving the best overall performance among these methods.
  • Figure 5: Test Accuracy with different $\beta$ values, with $95\%$ sparsity on CIFAR-10 LDA $\alpha=1.0$. The accuracy of the dense model (gray), the hyperparameter-free Spectral Exponent version, and the Top-K method are also reported for reference.
  • ...and 8 more figures