Table of Contents
Fetching ...

Route-and-Aggregate Decentralized Federated Learning Under Communication Errors

Weicai Li, Tiejun Lv, Wei Ni, Jingbo Zhao, Ekram Hossain, H. Vincent Poor

TL;DR

This work addresses the inefficiency of gossip-based decentralized federated learning under unreliable communications by introducing Route-and-Aggregate D-FL (R&A D-FL), which routes model updates along established paths and adaptively normalizes aggregation to account for partial deliveries. Theoretical analysis yields a one-round convergence upper bound that degrades with end-to-end packet error rates, and shows the optimum routing corresponds to minimizing E2E-PERs, enabling a standard shortest-path formulation. Empirical results across image classification and language tasks demonstrate that R&A D-FL substantially improves training accuracy over flooding-based D-FL (by ~35% in a 10-client network) and asymptotically matches C-FL as routing nodes increase. The approach highlights a strong synergy between D-FL and networking and suggests practical routing strategies to bolster distributed learning in imperfect networks.

Abstract

Decentralized federated learning (D-FL) allows clients to aggregate learning models locally, offering flexibility and scalability. Existing D-FL methods use gossip protocols, which are inefficient when not all nodes in the network are D-FL clients. This paper puts forth a new D-FL strategy, termed Route-and-Aggregate (R&A) D-FL, where participating clients exchange models with their peers through established routes (as opposed to flooding) and adaptively normalize their aggregation coefficients to compensate for communication errors. The impact of routing and imperfect links on the convergence of R&A D-FL is analyzed, revealing that convergence is minimized when routes with the minimum end-to-end packet error rates are employed to deliver models. Our analysis is experimentally validated through three image classification tasks and two next-word prediction tasks, utilizing widely recognized datasets and models. R&A D-FL outperforms the flooding-based D-FL method in terms of training accuracy by 35% in our tested 10-client network, and shows strong synergy between D-FL and networking. In another test with 10 D-FL clients, the training accuracy of R&A D-FL with communication errors approaches that of the ideal C-FL without communication errors, as the number of routing nodes (i.e., nodes that do not participate in the training of D-FL) rises to 28.

Route-and-Aggregate Decentralized Federated Learning Under Communication Errors

TL;DR

This work addresses the inefficiency of gossip-based decentralized federated learning under unreliable communications by introducing Route-and-Aggregate D-FL (R&A D-FL), which routes model updates along established paths and adaptively normalizes aggregation to account for partial deliveries. Theoretical analysis yields a one-round convergence upper bound that degrades with end-to-end packet error rates, and shows the optimum routing corresponds to minimizing E2E-PERs, enabling a standard shortest-path formulation. Empirical results across image classification and language tasks demonstrate that R&A D-FL substantially improves training accuracy over flooding-based D-FL (by ~35% in a 10-client network) and asymptotically matches C-FL as routing nodes increase. The approach highlights a strong synergy between D-FL and networking and suggests practical routing strategies to bolster distributed learning in imperfect networks.

Abstract

Decentralized federated learning (D-FL) allows clients to aggregate learning models locally, offering flexibility and scalability. Existing D-FL methods use gossip protocols, which are inefficient when not all nodes in the network are D-FL clients. This paper puts forth a new D-FL strategy, termed Route-and-Aggregate (R&A) D-FL, where participating clients exchange models with their peers through established routes (as opposed to flooding) and adaptively normalize their aggregation coefficients to compensate for communication errors. The impact of routing and imperfect links on the convergence of R&A D-FL is analyzed, revealing that convergence is minimized when routes with the minimum end-to-end packet error rates are employed to deliver models. Our analysis is experimentally validated through three image classification tasks and two next-word prediction tasks, utilizing widely recognized datasets and models. R&A D-FL outperforms the flooding-based D-FL method in terms of training accuracy by 35% in our tested 10-client network, and shows strong synergy between D-FL and networking. In another test with 10 D-FL clients, the training accuracy of R&A D-FL with communication errors approaches that of the ideal C-FL without communication errors, as the number of routing nodes (i.e., nodes that do not participate in the training of D-FL) rises to 28.

Paper Structure

This paper contains 26 sections, 8 theorems, 45 equations, 10 figures, 3 tables.

Key Result

Lemma 1

Under Assumption assumption with $L$ and $\mu$ defined therein, the expectation of the distance between the global model of D-FL in the $t$-th training round, i.e., $\boldsymbol{\bar{\omega}}_I^{t}$, and the global optimum of D-FL, i.e., $\mathbf{w}^*$, is bounded as where the coefficients on the right-hand side (RHS) of theorem_expectation are and $\tau_{\rho}$ indicates the noise level of the

Figures (10)

  • Figure 1: An illustration of the selected routes for local model delivery from participating clients to the designated client in the $t$-th round.
  • Figure 2: The training accuracy and loss of Fed-fashionMNIST dataset on CNN model versus the training round.
  • Figure 3: Training accuracy of D-FL versus the edge density and the packet length, where the CNN model and non-i.i.d. Fed-FashionMNIST dataset are considered. C-FL selects the best-performing client to serve as the central aggregator.
  • Figure 4: Training accuracy vs. training rounds, where the ResNet18 model and non-i.i.d. Fed-CIFAR100 dataset are considered. C-FL selects the best-performing client to serve as the central aggregator.
  • Figure 5: Training accuracy vs. training rounds, where the ResNet56 model and CIFAR10 dataset are considered. C-FL selects the best-performing client to serve as the central aggregator.
  • ...and 5 more figures

Theorems & Definitions (13)

  • Lemma 1
  • proof
  • Lemma 2
  • Lemma 3
  • proof
  • Theorem 1
  • Theorem 2
  • Proposition 1
  • proof
  • Lemma 4
  • ...and 3 more