SparsyFed: Sparse Adaptive Federated Training
Adriano Guastella, Lorenzo Sani, Alex Iacob, Alessio Mora, Paolo Bellavista, Nicholas D. Lane
TL;DR
SparsyFed tackles the practical challenges of sparse training in cross-device FL by introducing a dynamic masking approach coupled with a sparsity-inducing weight re-parameterization (Powerpropagation). It achieves up to $95\%$ sparsity with negligible accuracy loss and dramatically reduces communication costs, aided by fast mask consensus across clients. The method is agnostic to server optimizers and client selection, requiring only one additional hyperparameter (or a hyperparameter-free variant using a spectral-exponent). Empirical results on CIFAR-10/100 and Speech Commands show superior accuracy to fixed-mask baselines while offering substantial uplink/downlink savings, making sparse FL more viable in heterogeneous, resource-constrained environments.
Abstract
Sparse training is often adopted in cross-device federated learning (FL) environments where constrained devices collaboratively train a machine learning model on private data by exchanging pseudo-gradients across heterogeneous networks. Although sparse training methods can reduce communication overhead and computational burden in FL, they are often not used in practice for the following key reasons: (1) data heterogeneity makes it harder for clients to reach consensus on sparse models compared to dense ones, requiring longer training; (2) methods for obtaining sparse masks lack adaptivity to accommodate very heterogeneous data distributions, crucial in cross-device FL; and (3) additional hyperparameters are required, which are notably challenging to tune in FL. This paper presents SparsyFed, a practical federated sparse training method that critically addresses the problems above. Previous works have only solved one or two of these challenges at the expense of introducing new trade-offs, such as clients' consensus on masks versus sparsity pattern adaptivity. We show that SparsyFed simultaneously (1) can produce 95% sparse models, with negligible degradation in accuracy, while only needing a single hyperparameter, (2) achieves a per-round weight regrowth 200 times smaller than previous methods, and (3) allows the sparse masks to adapt to highly heterogeneous data distributions and outperform all baselines under such conditions.
