Table of Contents
Fetching ...

Robust and Communication-Efficient Federated Learning from Non-IID Data

Felix Sattler, Simon Wiedemann, Klaus-Robert Müller, Wojciech Samek

TL;DR

This work tackles the high communication cost of Federated Learning under non-IID data by introducing Sparse Ternary Compression (STC), a framework that compresses both upstream and downstream updates using sparsification, ternarization to {$-\,\mu,0,\mu$}, residual accumulation, and Golomb encoding. It further extends STC with server-side downstream compression, a weight-update caching mechanism for partial participation, and redundancy elimination via binarization, achieving strong bidirectional compression with minimal accuracy loss. Empirical results across four tasks show STC consistently outperforms Federated Averaging and signSGD in non-IID, small-batch, and low-participation regimes, while delivering substantial communication savings even in IID settings. The method enables robust, bandwidth-efficient Federated Learning suitable for large-scale IoT deployments, where high-frequency, low-volume communication is preferable to infrequent, high-volume transfers. The results hinge on carefully designed encoding and residual mechanisms that keep updates accurate despite aggressive sparsification.

Abstract

Federated Learning allows multiple parties to jointly train a deep learning model on their combined data, without any of the participants having to reveal their local data to a centralized server. This form of privacy-preserving collaborative learning however comes at the cost of a significant communication overhead during training. To address this problem, several compression methods have been proposed in the distributed training literature that can reduce the amount of required communication by up to three orders of magnitude. These existing methods however are only of limited utility in the Federated Learning setting, as they either only compress the upstream communication from the clients to the server (leaving the downstream communication uncompressed) or only perform well under idealized conditions such as iid distribution of the client data, which typically can not be found in Federated Learning. In this work, we propose Sparse Ternary Compression (STC), a new compression framework that is specifically designed to meet the requirements of the Federated Learning environment. Our experiments on four different learning tasks demonstrate that STC distinctively outperforms Federated Averaging in common Federated Learning scenarios where clients either a) hold non-iid data, b) use small batch sizes during training, or where c) the number of clients is large and the participation rate in every communication round is low. We furthermore show that even if the clients hold iid data and use medium sized batches for training, STC still behaves pareto-superior to Federated Averaging in the sense that it achieves fixed target accuracies on our benchmarks within both fewer training iterations and a smaller communication budget.

Robust and Communication-Efficient Federated Learning from Non-IID Data

TL;DR

This work tackles the high communication cost of Federated Learning under non-IID data by introducing Sparse Ternary Compression (STC), a framework that compresses both upstream and downstream updates using sparsification, ternarization to {}, residual accumulation, and Golomb encoding. It further extends STC with server-side downstream compression, a weight-update caching mechanism for partial participation, and redundancy elimination via binarization, achieving strong bidirectional compression with minimal accuracy loss. Empirical results across four tasks show STC consistently outperforms Federated Averaging and signSGD in non-IID, small-batch, and low-participation regimes, while delivering substantial communication savings even in IID settings. The method enables robust, bandwidth-efficient Federated Learning suitable for large-scale IoT deployments, where high-frequency, low-volume communication is preferable to infrequent, high-volume transfers. The results hinge on carefully designed encoding and residual mechanisms that keep updates accurate despite aggressive sparsification.

Abstract

Federated Learning allows multiple parties to jointly train a deep learning model on their combined data, without any of the participants having to reveal their local data to a centralized server. This form of privacy-preserving collaborative learning however comes at the cost of a significant communication overhead during training. To address this problem, several compression methods have been proposed in the distributed training literature that can reduce the amount of required communication by up to three orders of magnitude. These existing methods however are only of limited utility in the Federated Learning setting, as they either only compress the upstream communication from the clients to the server (leaving the downstream communication uncompressed) or only perform well under idealized conditions such as iid distribution of the client data, which typically can not be found in Federated Learning. In this work, we propose Sparse Ternary Compression (STC), a new compression framework that is specifically designed to meet the requirements of the Federated Learning environment. Our experiments on four different learning tasks demonstrate that STC distinctively outperforms Federated Averaging in common Federated Learning scenarios where clients either a) hold non-iid data, b) use small batch sizes during training, or where c) the number of clients is large and the participation rate in every communication round is low. We furthermore show that even if the clients hold iid data and use medium sized batches for training, STC still behaves pareto-superior to Federated Averaging in the sense that it achieves fixed target accuracies on our benchmarks within both fewer training iterations and a smaller communication budget.

Paper Structure

This paper contains 21 sections, 16 equations, 16 figures, 4 tables, 5 algorithms.

Figures (16)

  • Figure 1: Federated Learning with a parameter server. Illustrated is one communication round of distributed SGD: a) Clients synchronize with the server. b) Clients compute a weight update independently based on their local data. c) Clients upload their local weight updates to the server, where they are averaged to produce the new master model.
  • Figure 2: Convergence speed when using different compression methods during the training of VGG11*on CIFAR-10 and Logistic Regression on MNIST and Fashion-MNIST in a distributed setting with 10 clients for iid and non-iid data. In the non-iid cases, every client only holds examples from exactly two respectively one of the 10 classes in the dataset. All compression methods suffer from degraded convergence speed in the non-iid situation, but sparse top-k is affected by far the least.
  • Figure 3: Left: Distribution of values for $\alpha_w(1)$ for the weight layer of a logistic regression over the MNIST dataset. Right: Development of $\alpha(k)$ for increasing batch sizes. In the iid case the batches are sampled randomly from the training data, while in the non-iid case every batch contains samples from only exactly one class. For iid batches the gradient sign becomes increasingly accurate with growing batch sizes. For non-iid batches of data this is not the case. The gradient signs remain highly incongruent with the full-batch gradient, no matter how large the size of the batch.
  • Figure 4: Accuracy achieved by VGG11* when trained on CIFAR in a distributed setting with 5 clients for 16000 iterations at different levels of upload and download sparsity. Sparsifying the updates for downstream communication reduces the final accuracy by at most 3% when compared to using only upload sparsity.
  • Figure 5: The effects of binarization at different levels of upload- and download sparsity. Displayed is the difference in final accuracy in % between a model trained with sparse updates and a model trained with sparse binarized updates. Positive numbers indicate better performance of the model trained with pure sparsity. VGG11 trained on CIFAR10 for 16000 iterations with 5 clients holding iid and non-iid data.
  • ...and 11 more figures