A Hierarchical Gradient Tracking Algorithm for Mitigating Subnet-Drift in Fog Learning Networks
Evan Chen, Shiqiang Wang, Christopher G. Brinton
TL;DR
This work addresses subnet-drift in semi-decentralized federated learning over fog networks. It introduces SD-GT, a gradient-tracking framework with separate within-subnet and between-subnet tracking terms to stabilize learning across a two-time-scale SD-FL architecture. The authors provide Lyapunov-based convergence bounds for non-convex, convex, and strongly convex settings, and develop a geometric-programming based co-optimization to trade off convergence speed and communication cost. Empirical results on MNIST, CIFAR, and synthetic tasks show substantial improvements in model quality and communication efficiency over SD-FL baselines and prior gradient-tracking methods.
Abstract
Federated learning (FL) encounters scalability challenges when implemented over fog networks that do not follow FL's conventional star topology architecture. Semi-decentralized FL (SD-FL) has proposed a solution for device-to-device (D2D) enabled networks that divides model cooperation into two stages: at the lower stage, D2D communications is employed for local model aggregations within subnetworks (subnets), while the upper stage handles device-server (DS) communications for global model aggregations. However, existing SD-FL schemes are based on gradient diversity assumptions that become performance bottlenecks as data distributions become more heterogeneous. In this work, we develop semi-decentralized gradient tracking (SD-GT), the first SD-FL methodology that removes the need for such assumptions by incorporating tracking terms into device updates for each communication layer. Our analytical characterization of SD-GT reveals upper bounds on convergence for non-convex, convex, and strongly-convex problems. We show how the bounds enable the development of an optimization algorithm that navigates the performance-efficiency trade-off by tuning subnet sampling rate and D2D rounds for each global training interval. Our subsequent numerical evaluations demonstrate that SD-GT obtains substantial improvements in trained model quality and communication cost relative to baselines in SD-FL and gradient tracking on several datasets.
