Table of Contents
Fetching ...

Communication-Efficient Federated Optimization over Semi-Decentralized Networks

He Wang, Yuejie Chi

TL;DR

This work tackles the communication bottleneck in large-scale federated and decentralized learning by proposing PISCO, a gradient-tracking–based algorithm designed for semi-decentralized networks where server access occurs with probability $p$. PISCO integrates multiple local updates with a probabilistic mix of agent-to-server and agent-to-agent communications, enabling a linear speedup in the number of agents $n$ and local updates $T_o$. The authors prove convergence to a stationary point with rate $O\left(\dfrac{1}{\sqrt{nT_oK}}\right)$ for mini-batch gradients and $O\left(\dfrac{1}{nK}\right)$ for full-batch gradients, while reducing network dependency to $O\left(\lambda_p^{-2}\right)$ under appropriate $p$ and connectivity, without requiring bounded data dissimilarity. Empirical results on logistic regression with nonconvex regularization and neural networks confirm improved communication efficiency, robustness to heterogeneity, and resilience across network topologies, highlighting PISCO’s practical impact for scalable distributed learning.

Abstract

In large-scale federated and decentralized learning, communication efficiency is one of the most challenging bottlenecks. While gossip communication -- where agents can exchange information with their connected neighbors -- is more cost-effective than communicating with the remote server, it often requires a greater number of communication rounds, especially for large and sparse networks. To tackle the trade-off, we examine the communication efficiency under a semi-decentralized communication protocol, in which agents can perform both agent-to-agent and agent-to-server communication in a probabilistic manner. We design a tailored communication-efficient algorithm over semi-decentralized networks, referred to as PISCO, which inherits the robustness to data heterogeneity thanks to gradient tracking and allows multiple local updates for saving communication. We establish the convergence rate of PISCO for nonconvex problems and show that PISCO enjoys a linear speedup in terms of the number of agents and local updates. Our numerical results highlight the superior communication efficiency of PISCO and its resilience to data heterogeneity and various network topologies.

Communication-Efficient Federated Optimization over Semi-Decentralized Networks

TL;DR

This work tackles the communication bottleneck in large-scale federated and decentralized learning by proposing PISCO, a gradient-tracking–based algorithm designed for semi-decentralized networks where server access occurs with probability . PISCO integrates multiple local updates with a probabilistic mix of agent-to-server and agent-to-agent communications, enabling a linear speedup in the number of agents and local updates . The authors prove convergence to a stationary point with rate for mini-batch gradients and for full-batch gradients, while reducing network dependency to under appropriate and connectivity, without requiring bounded data dissimilarity. Empirical results on logistic regression with nonconvex regularization and neural networks confirm improved communication efficiency, robustness to heterogeneity, and resilience across network topologies, highlighting PISCO’s practical impact for scalable distributed learning.

Abstract

In large-scale federated and decentralized learning, communication efficiency is one of the most challenging bottlenecks. While gossip communication -- where agents can exchange information with their connected neighbors -- is more cost-effective than communicating with the remote server, it often requires a greater number of communication rounds, especially for large and sparse networks. To tackle the trade-off, we examine the communication efficiency under a semi-decentralized communication protocol, in which agents can perform both agent-to-agent and agent-to-server communication in a probabilistic manner. We design a tailored communication-efficient algorithm over semi-decentralized networks, referred to as PISCO, which inherits the robustness to data heterogeneity thanks to gradient tracking and allows multiple local updates for saving communication. We establish the convergence rate of PISCO for nonconvex problems and show that PISCO enjoys a linear speedup in terms of the number of agents and local updates. Our numerical results highlight the superior communication efficiency of PISCO and its resilience to data heterogeneity and various network topologies.
Paper Structure (29 sections, 15 theorems, 104 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 29 sections, 15 theorems, 104 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Suppose Assumption assump:graph, assump:smooth and assump:bounded_variance hold. Let $\tilde{f} = f(\overline{\boldsymbol{x}}^0) - f^\star$ and $\boldsymbol{\Phi}_y^0 = \boldsymbol{Y}^0 - \boldsymbol{Y}^0\boldsymbol{J}$. For any $\alpha\ge 0.1$ s.t. $\eta_c = \alpha\sqrt{(1+p)}\lambda_p$ and $\eta_l where the average model estimate $\overline{\boldsymbol{x}}^k = \frac{1}{n}\sum_{i=1}^n \boldsymbol

Figures (7)

  • Figure 1: Two communication models for distributed ML.
  • Figure 2: The semi-decentralized communication protocol, where the server can be accessed with probability $p$ and agents can communicate with their neighbors whenever the server is not available. Here, dotted lines represents the agent-to-server communication, while the solid ones are for agent-to-agent communication.
  • Figure 3: The network dependency of PISCO regarding agent-to-server communication probability $p$.
  • Figure 4: The number of agent-to-agent and agent-to-server communication rounds required to achieve $0.05$ training accuracy (the left panel) and $80\%$ test accuracy (the right panel) for PISCO with $T_o=1$ and different $p \in\{0,10^{-2},10^{-1.75},10^{-1.5}, 10^{-1.25},10^{-1},10^{-0.75},10^{-0.5},1\}$. Here, the blue (red) dotted line represents the number of agent-to-agent (agent-to-server) communication rounds that PISCO with $p=0$ (with $p=1$) requires.
  • Figure 5: The training accuracy (left two panels) and testing accuracy (right two panels) against communication rounds with different probabilities $p=1,10^{-0.5},10^{-1},0$ and different number of local updates $T_o=1,10$, over a ring topology for logistic regression with a nonconvex regularizer on the sorted a9a dataset.
  • ...and 2 more figures

Theorems & Definitions (29)

  • Definition 1: Mixing matrix and mixing rate
  • Theorem 1: Convergence rate
  • Corollary 1: Convergence rate with mini batch
  • Remark 1: Decentralized case
  • Remark 2: Federated case
  • Remark 3: For well-connected networks
  • Remark 4: For poorly-connected networks
  • Corollary 2: Communication complexity with large batch
  • Proposition 1
  • Proposition 2
  • ...and 19 more