Communication-Efficient Federated Learning via Regularized Sparse Random Networks
Mohamad Mestoukirdi, Omid Esrafilian, David Gesbert, Qianrui Li, Nicolas Gresset
TL;DR
This paper tackles communication bottlenecks in Federated Learning by training over-parameterized random networks and exchanging sparse binary masks (up to $1$ bit per parameter). It identifies that existing stochastic masking approaches do not reliably produce sparse sub-networks under consistent objectives and introduces a regularized loss that penalizes mask entropy to promote sparsity while preserving generalization. The authors formalize the objective $\bar{F}(\boldsymbol{m}) = \frac{1}{\sum_i |\mathcal{D}_i|} \sum_{k=1}^K |\mathcal{D}_k| \ell(y_{\boldsymbol{m}}, \mathcal{D}_k) + \frac{\lambda}{n} H(\boldsymbol{m})$ and define a local loss with a regularization term, enabling training with straight-through estimators and Bernoulli mask sampling. Experiments on MNIST, CIFAR-10, and CIFAR-100 under IID and non-IID conditions show substantial gains in communication and memory efficiency—up to about five orders of magnitude—while maintaining competitive validation accuracy, demonstrating practical impact for resource-constrained edge FL.
Abstract
This work presents a new method for enhancing communication efficiency in stochastic Federated Learning that trains over-parameterized random networks. In this setting, a binary mask is optimized instead of the model weights, which are kept fixed. The mask characterizes a sparse sub-network that is able to generalize as good as a smaller target network. Importantly, sparse binary masks are exchanged rather than the floating point weights in traditional federated learning, reducing communication cost to at most 1 bit per parameter (Bpp). We show that previous state of the art stochastic methods fail to find sparse networks that can reduce the communication and storage overhead using consistent loss objectives. To address this, we propose adding a regularization term to local objectives that acts as a proxy of the transmitted masks entropy, therefore encouraging sparser solutions by eliminating redundant features across sub-networks. Extensive empirical experiments demonstrate significant improvements in communication and memory efficiency of up to five magnitudes compared to the literature, with minimal performance degradation in validation accuracy in some instances
