Table of Contents
Fetching ...

FedCod: An Efficient Communication Protocol for Cross-Silo Federated Learning with Coding

Peishen Yan, Jun Li, Hao Wang, Tao Song, Yang Hua, Lu Peng, Haihui Zhou, Haibing Guan

TL;DR

FedCod tackles WAN heterogeneity and bottlenecks in cross-silo Federated Learning by introducing an application-layer coding protocol that enables client-to-client forwarding and adaptive redundancy. It partitions the download phase with server-side encoding and the upload phase with per-client encoding and forward-forward aggregation (Coded-AGR), all while remaining agnostic to the underlying FL algorithm. Key contributions include (i) tailored coding strategies for download and upload, (ii) a novel Coded Aggregation mechanism, and (iii) an adaptive redundancy algorithm that balances reliability and traffic. Experimental results on global and NA WAN topologies show up to 62% reduction in total communication time with maintained convergence, validating FedCod’s practical impact for large-scale cross-silo FL deployments over heterogeneous networks.

Abstract

Federated Learning (FL) is an innovative distributed machine learning paradigm that enables multiple parties to collaboratively train a model without sharing their raw data, thereby preserving data privacy. Communication efficiency concerns arise in cross-silo FL, particularly due to the network heterogeneity and fluctuations associated with geo-distributed silos. Most existing solutions to these problems focus on algorithmic improvements that alter the FL algorithm but sacrificing the training performance. How to address these problems from a network perspective that is decoupled from the FL algorithm remains an open challenge. In this paper, we propose FedCod, a new application layer communication protocol designed for cross-silo FL. FedCod transparently utilizes a coding mechanism to enhance the efficient use of idle bandwidth through client-to-client communication, and dynamically adjusts coding redundancy to mitigate network bottlenecks and fluctuations, thereby improving the communication efficiency and accelerating the training process. In our real-world experiments, FedCod demonstrates a significant reduction in average communication time by up to 62% compared to the baseline, while maintaining FL training performance and optimizing inter-client communication traffic.

FedCod: An Efficient Communication Protocol for Cross-Silo Federated Learning with Coding

TL;DR

FedCod tackles WAN heterogeneity and bottlenecks in cross-silo Federated Learning by introducing an application-layer coding protocol that enables client-to-client forwarding and adaptive redundancy. It partitions the download phase with server-side encoding and the upload phase with per-client encoding and forward-forward aggregation (Coded-AGR), all while remaining agnostic to the underlying FL algorithm. Key contributions include (i) tailored coding strategies for download and upload, (ii) a novel Coded Aggregation mechanism, and (iii) an adaptive redundancy algorithm that balances reliability and traffic. Experimental results on global and NA WAN topologies show up to 62% reduction in total communication time with maintained convergence, validating FedCod’s practical impact for large-scale cross-silo FL deployments over heterogeneous networks.

Abstract

Federated Learning (FL) is an innovative distributed machine learning paradigm that enables multiple parties to collaboratively train a model without sharing their raw data, thereby preserving data privacy. Communication efficiency concerns arise in cross-silo FL, particularly due to the network heterogeneity and fluctuations associated with geo-distributed silos. Most existing solutions to these problems focus on algorithmic improvements that alter the FL algorithm but sacrificing the training performance. How to address these problems from a network perspective that is decoupled from the FL algorithm remains an open challenge. In this paper, we propose FedCod, a new application layer communication protocol designed for cross-silo FL. FedCod transparently utilizes a coding mechanism to enhance the efficient use of idle bandwidth through client-to-client communication, and dynamically adjusts coding redundancy to mitigate network bottlenecks and fluctuations, thereby improving the communication efficiency and accelerating the training process. In our real-world experiments, FedCod demonstrates a significant reduction in average communication time by up to 62% compared to the baseline, while maintaining FL training performance and optimizing inter-client communication traffic.
Paper Structure (23 sections, 1 theorem, 3 equations, 9 figures, 3 tables)

This paper contains 23 sections, 1 theorem, 3 equations, 9 figures, 3 tables.

Key Result

Proposition 1

The wait mode has theoretically better communication performance than the non-wait mode.

Figures (9)

  • Figure 1: Detailed location information and communication bandwidth profiling results: (a) Global topology; (b) North America topology; (c) Profiling results for the global topology; (d) Profiling results for the North America topology.
  • Figure 2: The communication overhead for baseline and our two adapted network coding communication protocols (one in the download phase and the other in the upload phase).
  • Figure 3: The download phase of FedCod ($k=2$). The server encodes the global model partitions with random coefficient vectors. Step ❶: The server sends different encoded data blocks to different clients; Step ❷: The clients send the data blocks in the buffer to all neighbors; Step ❸: The client decodes a global model with enough encoded data blocks.
  • Figure 4: The upload phase of FedCod with Coded-AGR algorithm ($k=2$). All clients encode the local model updates with the same sequence of coefficient vectors. Step ❶: The clients send the encoded data blocks to the specific neighbors with the predefined mapping; Step ❷: Each client maintains a Coded-AGR buffer and aggregates the encoded data blocks with the same coefficient vector into an AGR data block; Step ❸: The clients send the AGR data blocks to the server. Finally, the server decodes an aggregated global model from the AGR data blocks.
  • Figure 5: The communication time of different communication protocols. Baseline: the basic server-client communication protocol; HierFL: the hierarchical FL communication protocol; D1-NC: apply network coding in the download phase; D2-C: apply our coding strategy in the download phase; U1-C: apply our coding strategy in the upload phase; U2-AGR: apply non-wait mode Coded-AGR in the upload phase; U3-AGR: apply wait mode Coded-AGR in the upload phase; FedCod: FedCod with static redundancy; Adaptive: FedCod with adaptive redundancy.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof