Trading-off Accuracy and Communication Cost in Federated Learning
Mattia Jacopo Villani, Emanuele Natale, Frederik Mallmann-Trenn
TL;DR
This paper tackles the challenge of high communication cost in federated learning by introducing Zampling, a framework that replaces the full parameter vector $\vec{w}$ with a product $\vec{w}=Q\vec{p}$ where $Q\in\mathbb{R}^{m\times n}$ is a fixed sparse matrix and $\vec{p}\in[0,1]^n$ is trained. By exploiting stochastic sampling $\vec{z}\sim\mathrm{Bernoulli}(\vec{p})$ to form $\vec{w}=Q\vec{z}$ during forward passes, clients can exchange only $n$ bits per parameter update, enabling up to $1024$-fold total compression relative to sending full-precision weights. The framework generalizes training-by-sampling beyond Zhou et al., and links the method to random convex geometry through zonotopes and zonoids, offering theoretical insights on initialization, the role of the degree $d$ of $Q$, and the benefits of federated averaging on the explored solution space. Empirically, Zampling achieves large compression with minimal accuracy loss on MNIST, and the federated variant demonstrates substantial communication savings with maintained performance, while Local Zampling shows improved generalisation under parameter perturbations. These results suggest a practical path to scalable, privacy-preserving, communication-efficient federated learning with strong theoretical underpinnings in convex geometry.
Abstract
Leveraging the training-by-pruning paradigm introduced by Zhou et al. and Isik et al. introduced a federated learning protocol that achieves a 34-fold reduction in communication cost. We achieve a compression improvements of orders of orders of magnitude over the state-of-the-art. The central idea of our framework is to encode the network weights $\vec w$ by a the vector of trainable parameters $\vec p$, such that $\vec w = Q\cdot \vec p$ where $Q$ is a carefully-generate sparse random matrix (that remains fixed throughout training). In such framework, the previous work of Zhou et al. [NeurIPS'19] is retrieved when $Q$ is diagonal and $\vec p$ has the same dimension of $\vec w$. We instead show that $\vec p$ can effectively be chosen much smaller than $\vec w$, while retaining the same accuracy at the price of a decrease of the sparsity of $Q$. Since server and clients only need to share $\vec p$, such a trade-off leads to a substantial improvement in communication cost. Moreover, we provide theoretical insight into our framework and establish a novel link between training-by-sampling and random convex geometry.
