Table of Contents
Fetching ...

A Hierarchical Gradient Tracking Algorithm for Mitigating Subnet-Drift in Fog Learning Networks

Evan Chen, Shiqiang Wang, Christopher G. Brinton

TL;DR

This work addresses subnet-drift in semi-decentralized federated learning over fog networks. It introduces SD-GT, a gradient-tracking framework with separate within-subnet and between-subnet tracking terms to stabilize learning across a two-time-scale SD-FL architecture. The authors provide Lyapunov-based convergence bounds for non-convex, convex, and strongly convex settings, and develop a geometric-programming based co-optimization to trade off convergence speed and communication cost. Empirical results on MNIST, CIFAR, and synthetic tasks show substantial improvements in model quality and communication efficiency over SD-FL baselines and prior gradient-tracking methods.

Abstract

Federated learning (FL) encounters scalability challenges when implemented over fog networks that do not follow FL's conventional star topology architecture. Semi-decentralized FL (SD-FL) has proposed a solution for device-to-device (D2D) enabled networks that divides model cooperation into two stages: at the lower stage, D2D communications is employed for local model aggregations within subnetworks (subnets), while the upper stage handles device-server (DS) communications for global model aggregations. However, existing SD-FL schemes are based on gradient diversity assumptions that become performance bottlenecks as data distributions become more heterogeneous. In this work, we develop semi-decentralized gradient tracking (SD-GT), the first SD-FL methodology that removes the need for such assumptions by incorporating tracking terms into device updates for each communication layer. Our analytical characterization of SD-GT reveals upper bounds on convergence for non-convex, convex, and strongly-convex problems. We show how the bounds enable the development of an optimization algorithm that navigates the performance-efficiency trade-off by tuning subnet sampling rate and D2D rounds for each global training interval. Our subsequent numerical evaluations demonstrate that SD-GT obtains substantial improvements in trained model quality and communication cost relative to baselines in SD-FL and gradient tracking on several datasets.

A Hierarchical Gradient Tracking Algorithm for Mitigating Subnet-Drift in Fog Learning Networks

TL;DR

This work addresses subnet-drift in semi-decentralized federated learning over fog networks. It introduces SD-GT, a gradient-tracking framework with separate within-subnet and between-subnet tracking terms to stabilize learning across a two-time-scale SD-FL architecture. The authors provide Lyapunov-based convergence bounds for non-convex, convex, and strongly convex settings, and develop a geometric-programming based co-optimization to trade off convergence speed and communication cost. Empirical results on MNIST, CIFAR, and synthetic tasks show substantial improvements in model quality and communication efficiency over SD-FL baselines and prior gradient-tracking methods.

Abstract

Federated learning (FL) encounters scalability challenges when implemented over fog networks that do not follow FL's conventional star topology architecture. Semi-decentralized FL (SD-FL) has proposed a solution for device-to-device (D2D) enabled networks that divides model cooperation into two stages: at the lower stage, D2D communications is employed for local model aggregations within subnetworks (subnets), while the upper stage handles device-server (DS) communications for global model aggregations. However, existing SD-FL schemes are based on gradient diversity assumptions that become performance bottlenecks as data distributions become more heterogeneous. In this work, we develop semi-decentralized gradient tracking (SD-GT), the first SD-FL methodology that removes the need for such assumptions by incorporating tracking terms into device updates for each communication layer. Our analytical characterization of SD-GT reveals upper bounds on convergence for non-convex, convex, and strongly-convex problems. We show how the bounds enable the development of an optimization algorithm that navigates the performance-efficiency trade-off by tuning subnet sampling rate and D2D rounds for each global training interval. Our subsequent numerical evaluations demonstrate that SD-GT obtains substantial improvements in trained model quality and communication cost relative to baselines in SD-FL and gradient tracking on several datasets.
Paper Structure (47 sections, 15 theorems, 128 equations, 17 figures, 1 table, 2 algorithms)

This paper contains 47 sections, 15 theorems, 128 equations, 17 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

(Non-convex) Under Assumptions asmp1, asmp2, and assmp3, let $\beta_{s} = \frac{m_{s} - h_{s}}{m_{s}}$ be the ratio of unsampled clients from each subnet. Define $p = \min(1 - \beta_{1}^2 , \ldots,1 - \beta_{S}^2)\in (0,1]$, $q = \min(\rho_{1}, \ldots, \rho_{S})\in (0,1]$, and the function value opt

Figures (17)

  • Figure 1: Illustration of semi-decentralized FL with gradient tracking. Clients in each subnet communicate via iterative low-cost D2D communications to conduct local aggregations. Once they have converged towards a consensus within the subnet, the central server conducts a global aggregation across sampled devices using DS communication. Two separate terms related to gradient tracking are maintained, corresponding to within subnet and between subnet gradient information, respectively.
  • Figure 2: An illustration of how SD-GT mitigates subnet-drift. By introducing the within-subnet GT term $z_i^t$, all devices within each subnet are able to course-correct towards a consensual location of the subnet, mitigating the gradient difference between local updates and the average subnet direction. Further, the between-subnet GT term $y_i^t$ corrects the drift between subnets and the global gradient direction that arises due to inter-subnet heterogeneity, so that each subnet $\mathcal{C}_s$ no longer converges towards the optimal solution $x_{\mathcal{C}_s}^{\star}$, but to the optimal solution $x^{\star}$ of the whole network. Both gradient tracking terms are added during each round of local update, steering model update directions toward the global optimum.
  • Figure 3: Comparison between algorithms on CIFAR10 datasets when changing the number of local client updates and D2D consensus rounds $K$ between global aggregations. Each experiment is conducted with 30 clients and 3 subnets. As $K$ increases, SD-GT is able to take advantage of multiple in-subnet model update and consensus iterations while correcting for client drift to achieve better convergence speed, particularly on CIFAR10.
  • Figure 4: Comparison when changing the radius range ($q = 0.1, 0.17, 0.3, 0.87$) of the geometric graph for D2D communication links, on the CIFAR10 dataset. All experiments have $K = 3$, with 30 total clients and 3 subnets. Compared with SD-FedAvg, our method that combines both D2D communications and gradient tracking obtains robust performance over different subnet connectivity levels. SCAFFOLD's performance is unaffected since it does not employ D2D communications.
  • Figure 5: Impact of varying the total number of clients for $S = 6$ subnets ($K = 3$). As the size of each subnet increases, the within-subnet data heterogeneity increases. We see that SD-GT is able to obtain larger improvements over the baselines as the number of clients grows larger (particularly for CIFAR100, an intrinsically more heterogeneous dataset due to having more labels), due to its inclusion of within-subnet gradient tracking terms.
  • ...and 12 more figures

Theorems & Definitions (28)

  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Corollary 2
  • Theorem 3
  • Corollary 3
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • ...and 18 more