Delay Sensitive Hierarchical Federated Learning with Stochastic Local Updates
Abdulmoneam Ali, Ahmed Arafa
TL;DR
The paper addresses federated learning in delay-prone networks by proposing a delay-sensitive hierarchical FL (HFL) framework with local parameter servers (LPSs) and a global parameter server (GPS). Local updates are stochastic and occur for a random number of iterations $t_i^u$ within a global sync window $S$, with the overall training time constrained by a system deadline $T$; the GPS aggregates after the maximum group latency. The authors derive a bound on the LPS-GPS divergence (Lemma 1) and establish convergence guarantees for non-convex objectives at both the local-group and global levels, highlighting how the number of groups, group sizes, and $S$ govern performance. They show a sublinear convergence rate $\igO(1/\sqrt{\mathcal{U}})$ under reasonable local-time bounds and validate the theory with experiments across datasets, illustrating how to tune $S$ and clustering to mitigate delay effects and improve fairness. The work demonstrates that carefully designed synchronization and grouping can yield substantial gains in delay-constrained FL and provides practical guidance for deploying HFL in 6G-era networks.
Abstract
The impact of local averaging on the performance of federated learning (FL) systems is studied in the presence of communication delay between the clients and the parameter server. To minimize the effect of delay, clients are assigned into different groups, each having its own local parameter server (LPS) that aggregates its clients' models. The groups' models are then aggregated at a global parameter server (GPS) that only communicates with the LPSs. Such setting is known as hierarchical FL (HFL). Unlike most works in the literature, the number of local and global communication rounds in our work is randomly determined by the (different) delays experienced by each group of clients. Specifically, the number of local averaging rounds is tied to a wall-clock time period coined the sync time $S$, after which the LPSs synchronize their models by sharing them with the GPS. Such sync time $S$ is then reapplied until a global wall-clock time is exhausted. First, an upper bound on the deviation between the updated model at each LPS with respect to that available at the GPS is derived. This is then used as a tool to derive the convergence analysis of our proposed delay-sensitive HFL algorithm, first at each LPS individually, and then at the GPS. Our theoretical convergence bound showcases the effects of the whole system's parameters, including the number of groups, the number of clients per group, and the value of $S$. Our results show that the value of $S$ should be carefully chosen, especially since it implicitly governs how the delay statistics affect the performance of HFL in situations where training time is restricted.
