Table of Contents
Fetching ...

Adaptive Compression in Federated Learning via Side Information

Berivan Isik, Francesco Pase, Deniz Gunduz, Sanmi Koyejo, Tsachy Weissman, Michele Zorzi

TL;DR

This work introduces KL Minimization with Side Information (KLMS) to drastically reduce communication costs in Federated Learning by leveraging a server-side global distribution $p_{\theta}$ that closely matches client distributions in KL divergence. KLMS communicates samples via indices into $K$ samples drawn from $p_{\theta}$, with $K$ selected so that $K \approx \exp(D_{KL}(q_{\phi}||p_{\theta})+r)$, enabling per-client bitrate roughly equal to the KL divergence. The framework is adaptable to multiple stochastic FL paradigms (e.g., FedPM, QSGD, SignSGD, SGLD), and is enhanced by adaptive block allocation to balance bitrate across model coordinates and rounds. Empirical results on MNIST, EMNIST, CIFAR-10, and CIFAR-100 show substantial gains, including up to 82× bitrate reduction per framework and up to 2,650× overall compression with minimal or no loss in accuracy. The approach demonstrates that exploiting naturally available side information can set new standards for communication efficiency in FL and can be integrated without altering core training dynamics.

Abstract

The high communication cost of sending model updates from the clients to the server is a significant bottleneck for scalable federated learning (FL). Among existing approaches, state-of-the-art bitrate-accuracy tradeoffs have been achieved using stochastic compression methods -- in which the client $n$ sends a sample from a client-only probability distribution $q_{φ^{(n)}}$, and the server estimates the mean of the clients' distributions using these samples. However, such methods do not take full advantage of the FL setup where the server, throughout the training process, has side information in the form of a global distribution $p_θ$ that is close to the clients' distribution $q_{φ^{(n)}}$ in Kullback-Leibler (KL) divergence. In this work, we exploit this closeness between the clients' distributions $q_{φ^{(n)}}$'s and the side information $p_θ$ at the server, and propose a framework that requires approximately $D_{KL}(q_{φ^{(n)}}|| p_θ)$ bits of communication. We show that our method can be integrated into many existing stochastic compression frameworks to attain the same (and often higher) test accuracy with up to $82$ times smaller bitrate than the prior work -- corresponding to 2,650 times overall compression.

Adaptive Compression in Federated Learning via Side Information

TL;DR

This work introduces KL Minimization with Side Information (KLMS) to drastically reduce communication costs in Federated Learning by leveraging a server-side global distribution that closely matches client distributions in KL divergence. KLMS communicates samples via indices into samples drawn from , with selected so that , enabling per-client bitrate roughly equal to the KL divergence. The framework is adaptable to multiple stochastic FL paradigms (e.g., FedPM, QSGD, SignSGD, SGLD), and is enhanced by adaptive block allocation to balance bitrate across model coordinates and rounds. Empirical results on MNIST, EMNIST, CIFAR-10, and CIFAR-100 show substantial gains, including up to 82× bitrate reduction per framework and up to 2,650× overall compression with minimal or no loss in accuracy. The approach demonstrates that exploiting naturally available side information can set new standards for communication efficiency in FL and can be integrated without altering core training dynamics.

Abstract

The high communication cost of sending model updates from the clients to the server is a significant bottleneck for scalable federated learning (FL). Among existing approaches, state-of-the-art bitrate-accuracy tradeoffs have been achieved using stochastic compression methods -- in which the client sends a sample from a client-only probability distribution , and the server estimates the mean of the clients' distributions using these samples. However, such methods do not take full advantage of the FL setup where the server, throughout the training process, has side information in the form of a global distribution that is close to the clients' distribution in Kullback-Leibler (KL) divergence. In this work, we exploit this closeness between the clients' distributions 's and the side information at the server, and propose a framework that requires approximately bits of communication. We show that our method can be integrated into many existing stochastic compression frameworks to attain the same (and often higher) test accuracy with up to times smaller bitrate than the prior work -- corresponding to 2,650 times overall compression.
Paper Structure (33 sections, 3 theorems, 28 equations, 6 figures, 15 tables, 12 algorithms)

This paper contains 33 sections, 3 theorems, 28 equations, 6 figures, 15 tables, 12 algorithms.

Key Result

Theorem 4.1

Let $p_{\theta}$ and $q_{\phi^{(n)}}$ for $n=1, \dots, N$ be probability distributions over set $\mathcal{X}$ equipped with some sigma-algebra. Let $X^{(n)}$ be an $\mathcal{X}$-valued random variable with law $q_{\phi^{(n)}}$. Let $r \geq 0$ and $\tilde{q}_{\pi^{(n)}}$ for $n=1, \dots, N$ be discre Defining $\tilde{q}_{\pi^{(n)}}$ over $\{\mathbf{y}_{[k]}^{(n)}\}_{k=1}^{K^{(n)}}$ as $\tilde{q}_{\

Figures (6)

  • Figure 1: $\mathtt{KLMS}$ Outline. Note that the final sample $y^*$ is a sample from $\tilde{q}_{\pi^{(t, n)}}(\mathbf{y}) = \sum_{k=1}^K \pi^{(t,n)}(k) \cdot \mathbf{1}(\mathbf{y}_{[k]}^{(t, n)} = \mathbf{y})$.
  • Figure 2: Average KL divergence between the client-only and global distributions, for different layers and rounds (FedPM used to train CONV6 on CIFAR-$10$).
  • Figure 3: (top)SGLD-KLMS against QLSD using LeNet on i.i.d. MNIST dataset. (bottom)FedPM-KLMS (fixed) against FedPM-KLMS (adaptive) on how well the number of bits approaches the fundamental quantity, KL divergence -- using CONV6 on i.i.d. CIFAR-10. Both KL divergence and the number of bits are normalized by the number of parameters.
  • Figure 4: Estimation gap statistics for different values of $r$, as a function of the number of participating clients $N$. (left) The empirical standard deviation of the estimation gap, computed over $100$ runs. (right) Estimation gap between $\mu$ and $\hat{\mu}$ averaged over $100$ runs.
  • Figure 5: Estimation gap statistics for different values of $\eta$, as a function of the number of participating clients $N$. (left) The empirical standard deviation of the estimation gap, computed over $100$ runs. (right) Estimation gap between $\mu$ and $\hat{\mu}$ averaged over $100$ runs.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Theorem 4.1
  • Theorem C.1
  • proof
  • Theorem C.2: Theorem \ref{['thm2_main']}
  • proof