Table of Contents
Fetching ...

Trading-off Accuracy and Communication Cost in Federated Learning

Mattia Jacopo Villani, Emanuele Natale, Frederik Mallmann-Trenn

TL;DR

This paper tackles the challenge of high communication cost in federated learning by introducing Zampling, a framework that replaces the full parameter vector $\vec{w}$ with a product $\vec{w}=Q\vec{p}$ where $Q\in\mathbb{R}^{m\times n}$ is a fixed sparse matrix and $\vec{p}\in[0,1]^n$ is trained. By exploiting stochastic sampling $\vec{z}\sim\mathrm{Bernoulli}(\vec{p})$ to form $\vec{w}=Q\vec{z}$ during forward passes, clients can exchange only $n$ bits per parameter update, enabling up to $1024$-fold total compression relative to sending full-precision weights. The framework generalizes training-by-sampling beyond Zhou et al., and links the method to random convex geometry through zonotopes and zonoids, offering theoretical insights on initialization, the role of the degree $d$ of $Q$, and the benefits of federated averaging on the explored solution space. Empirically, Zampling achieves large compression with minimal accuracy loss on MNIST, and the federated variant demonstrates substantial communication savings with maintained performance, while Local Zampling shows improved generalisation under parameter perturbations. These results suggest a practical path to scalable, privacy-preserving, communication-efficient federated learning with strong theoretical underpinnings in convex geometry.

Abstract

Leveraging the training-by-pruning paradigm introduced by Zhou et al. and Isik et al. introduced a federated learning protocol that achieves a 34-fold reduction in communication cost. We achieve a compression improvements of orders of orders of magnitude over the state-of-the-art. The central idea of our framework is to encode the network weights $\vec w$ by a the vector of trainable parameters $\vec p$, such that $\vec w = Q\cdot \vec p$ where $Q$ is a carefully-generate sparse random matrix (that remains fixed throughout training). In such framework, the previous work of Zhou et al. [NeurIPS'19] is retrieved when $Q$ is diagonal and $\vec p$ has the same dimension of $\vec w$. We instead show that $\vec p$ can effectively be chosen much smaller than $\vec w$, while retaining the same accuracy at the price of a decrease of the sparsity of $Q$. Since server and clients only need to share $\vec p$, such a trade-off leads to a substantial improvement in communication cost. Moreover, we provide theoretical insight into our framework and establish a novel link between training-by-sampling and random convex geometry.

Trading-off Accuracy and Communication Cost in Federated Learning

TL;DR

This paper tackles the challenge of high communication cost in federated learning by introducing Zampling, a framework that replaces the full parameter vector with a product where is a fixed sparse matrix and is trained. By exploiting stochastic sampling to form during forward passes, clients can exchange only bits per parameter update, enabling up to -fold total compression relative to sending full-precision weights. The framework generalizes training-by-sampling beyond Zhou et al., and links the method to random convex geometry through zonotopes and zonoids, offering theoretical insights on initialization, the role of the degree of , and the benefits of federated averaging on the explored solution space. Empirically, Zampling achieves large compression with minimal accuracy loss on MNIST, and the federated variant demonstrates substantial communication savings with maintained performance, while Local Zampling shows improved generalisation under parameter perturbations. These results suggest a practical path to scalable, privacy-preserving, communication-efficient federated learning with strong theoretical underpinnings in convex geometry.

Abstract

Leveraging the training-by-pruning paradigm introduced by Zhou et al. and Isik et al. introduced a federated learning protocol that achieves a 34-fold reduction in communication cost. We achieve a compression improvements of orders of orders of magnitude over the state-of-the-art. The central idea of our framework is to encode the network weights by a the vector of trainable parameters , such that where is a carefully-generate sparse random matrix (that remains fixed throughout training). In such framework, the previous work of Zhou et al. [NeurIPS'19] is retrieved when is diagonal and has the same dimension of . We instead show that can effectively be chosen much smaller than , while retaining the same accuracy at the price of a decrease of the sparsity of . Since server and clients only need to share , such a trade-off leads to a substantial improvement in communication cost. Moreover, we provide theoretical insight into our framework and establish a novel link between training-by-sampling and random convex geometry.

Paper Structure

This paper contains 36 sections, 6 theorems, 14 equations, 6 figures, 3 tables.

Key Result

Lemma 2.1

Let the nonzero entries of the influence matrix $Q$ be distributed as: Let $p_j \sim \text{s-dist}[0,1], \; j = 1, \dots, n$, be independent and identically distributed (i.i.d.), where s-dist[0,1] is a symmetric distribution with support in [0, 1]. Define the vector $\vec{w} = Q\vec{p}$ , where each component $w_i$ is given by: $w_i = \sum_{j = 1}^n p_j q_{i,j}.$ Then, which simplifies to Kaiming

Figures (6)

  • Figure 1: An illustration of the Federated Zampling algorithm.
  • Figure 2: We assume $\mathbb{R}^n=\mathbb{R}^m$ (a) hypercube realised by the possible values of vector $\vec{p}$, (b) hyperrectangular zonotope generated by the choice of diagonal influence matrix, as in zhou, (c) polytopal zonotope generated by our choice of influence matrix, (d) zonoid of the zonotope generated from matrix $Q$, which yields an ellipsoid.
  • Figure 3: Trade-off between compression and accuracy small architecture in Local Zampling for varying levels of $d$.
  • Figure 4: Results of training Federated Zampling in the federated learning framework with varying levels of $d$.
  • Figure 5: In this figure we study the impact training-by-sampling has. The figures shows that if we only train the $\vec{p}$ directly and then sample a network in the end, it is not robust. However, selecting initialization that have abundand extreme values, decreases the integrality gap.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Lemma 2.1
  • Lemma 2.2
  • Lemma 2.3
  • Proposition 2.4
  • Definition 2.1: Random Zonotope
  • Proposition 2.5
  • Definition 2.2: $\tau$-Hypercube
  • Proposition 2.6: Benefits of Federated Learning