Table of Contents
Fetching ...

Taming Subnet-Drift in D2D-Enabled Fog Learning: A Hierarchical Gradient Tracking Approach

Evan Chen, Shiqiang Wang, Christopher G. Brinton

TL;DR

This work tackles subnet-drift in SD-FL by introducing Semi-Decentralized Gradient Tracking (SD-GT), which employs dual gradient-tracking terms to stabilize updates across the D2D and DS communication layers over two timescales. It provides Lyapunov-based convergence bounds for both non-convex and strongly convex objectives and proposes a co-optimization framework to trade learning speed against communication cost via subnet sampling and D2D rounds. Theoretical results are corroborated by extensive experiments on real-world and synthetic datasets, showing substantial improvements in model quality and communication efficiency over SD-FL and gradient-tracking baselines. The approach enables robust, scalable fog-learning deployments with tunable efficiency suitable for heterogeneous networks.

Abstract

Federated learning (FL) encounters scalability challenges when implemented over fog networks. Semi-decentralized FL (SD-FL) proposes a solution that divides model cooperation into two stages: at the lower stage, device-to-device (D2D) communications is employed for local model aggregations within subnetworks (subnets), while the upper stage handles device-server (DS) communications for global model aggregations. However, existing SD-FL schemes are based on gradient diversity assumptions that become performance bottlenecks as data distributions become more heterogeneous. In this work, we develop semi-decentralized gradient tracking (SD-GT), the first SD-FL methodology that removes the need for such assumptions by incorporating tracking terms into device updates for each communication layer. Analytical characterization of SD-GT reveals convergence upper bounds for both non-convex and strongly-convex problems, for a suitable choice of step size. We employ the resulting bounds in the development of a co-optimization algorithm for optimizing subnet sampling rates and D2D rounds according to a performance-efficiency trade-off. Our subsequent numerical evaluations demonstrate that SD-GT obtains substantial improvements in trained model quality and communication cost relative to baselines in SD-FL and gradient tracking on several datasets.

Taming Subnet-Drift in D2D-Enabled Fog Learning: A Hierarchical Gradient Tracking Approach

TL;DR

This work tackles subnet-drift in SD-FL by introducing Semi-Decentralized Gradient Tracking (SD-GT), which employs dual gradient-tracking terms to stabilize updates across the D2D and DS communication layers over two timescales. It provides Lyapunov-based convergence bounds for both non-convex and strongly convex objectives and proposes a co-optimization framework to trade learning speed against communication cost via subnet sampling and D2D rounds. Theoretical results are corroborated by extensive experiments on real-world and synthetic datasets, showing substantial improvements in model quality and communication efficiency over SD-FL and gradient-tracking baselines. The approach enables robust, scalable fog-learning deployments with tunable efficiency suitable for heterogeneous networks.

Abstract

Federated learning (FL) encounters scalability challenges when implemented over fog networks. Semi-decentralized FL (SD-FL) proposes a solution that divides model cooperation into two stages: at the lower stage, device-to-device (D2D) communications is employed for local model aggregations within subnetworks (subnets), while the upper stage handles device-server (DS) communications for global model aggregations. However, existing SD-FL schemes are based on gradient diversity assumptions that become performance bottlenecks as data distributions become more heterogeneous. In this work, we develop semi-decentralized gradient tracking (SD-GT), the first SD-FL methodology that removes the need for such assumptions by incorporating tracking terms into device updates for each communication layer. Analytical characterization of SD-GT reveals convergence upper bounds for both non-convex and strongly-convex problems, for a suitable choice of step size. We employ the resulting bounds in the development of a co-optimization algorithm for optimizing subnet sampling rates and D2D rounds according to a performance-efficiency trade-off. Our subsequent numerical evaluations demonstrate that SD-GT obtains substantial improvements in trained model quality and communication cost relative to baselines in SD-FL and gradient tracking on several datasets.
Paper Structure (18 sections, 11 theorems, 41 equations, 5 figures, 1 algorithm)

This paper contains 18 sections, 11 theorems, 41 equations, 5 figures, 1 algorithm.

Key Result

Theorem 1

(Non-convex) Under Assumptions asmp1, asmp2, and assmp3. Let $\beta_{s} = \frac{m_{s} - h_{s}}{m_{s}}$ be the ratio of unsampled clients from each subnet, and define the sample-wise mixing rate term $\phi_{s} = (1 - \beta_{s}^2)$ . Define $p = \min(\phi_{1} , \ldots, \phi_{S})\in (0,1]$, and $q = \m

Figures (5)

  • Figure 1: Illustration of semi-decentralized FL. Clients in each subnet communicate via iterative low-cost D2D communications to conduct local aggregations. Once they have converged towards a consensus within the subnet, the central server conducts a global aggregation across sampled devices using DS communication.
  • Figure 2: An illustration of how SD-GT deals with subnet-drifting. With the introduction of in-subnet GT term $z_i^t$, all clients within each subnet are able to converge towards a consensual location of the subnet. And the inter-subnet GT term $y_i^t$ corrects the update direction of the whole subnet so that it no longer converges towards the optimal solution $x_{\mathcal{C}_s}^*$of the subnet $\mathcal{C}_s$ but the optimal solution $x^*$ of the whole network.
  • Figure 3: Experimental results on real-world datasets ($\frac{h_{s}}{m_{s}} = 40\%$). By fixing the sampling rate for each subnet to $40$ percent and observe the effect of performing multiple D2D communication rounds, we see that our algorithm SD-GT gains the most improvement from increasing the D2D rounds. The advantage of our method can still be observed even if we double the number of clients in the system.
  • Figure 4: Experimental results from synthetic dataset $(K = 40)$. By fixing the number of D2D rounds performed between two global aggregations, we see that our algorithm SD-GT is able to improve from increasing the sampling rate of each subnet while maintaining linear rate.
  • Figure 5: Experiments on the proposed co-optimization algorithm. When $\delta$ is small (D2D communication is cheap), the co-optimization algorithm is able to choose the sample rate and the number of D2D communication rounds such that more improvement is obtained using the same amount of communication.

Theorems & Definitions (22)

  • Theorem 1
  • proof
  • Corollary 1
  • proof
  • Theorem 2
  • proof
  • Corollary 2
  • proof
  • Lemma 1
  • proof
  • ...and 12 more