Table of Contents
Fetching ...

WHALE-FL: Wireless and Heterogeneity Aware Latency Efficient Federated Learning over Mobile Devices via Adaptive Subnetwork Scheduling

Huai-an Su, Jiaxiang Geng, Liang Li, Xiaoqi Qin, Yanzhao Hou, Hao Wang, Xin Fu, Miao Pan

TL;DR

Federated learning on mobile devices suffers from compute and wireless heterogeneity, creating latency and stragglers. WHALE-FL introduces adaptive, width-based subnetwork scheduling guided by a joint utility that combines system efficiency and training efficiency via windowed Fisher information, enabling per-round, device-specific subnetwork sizing. The approach maps utility to discrete subnetworks and aggregates heterogeneous updates, with a WHALE-FL prototype showing substantial latency reductions (≈1.5x–2.1x) across MNIST, CIFAR-10, HAR, and WikiText-2 without sacrificing accuracy, and with analysis of Fisher information dynamics and hyperparameter sensitivity. This work enables scalable, fast FL on real-world, heterogeneous mobile devices by accommodating dynamic system and training demands in the scheduling policy.

Abstract

As a popular distributed learning paradigm, federated learning (FL) over mobile devices fosters numerous applications, while their practical deployment is hindered by participating devices' computing and communication heterogeneity. Some pioneering research efforts proposed to extract subnetworks from the global model, and assign as large a subnetwork as possible to the device for local training based on its full computing and communications capacity. Although such fixed size subnetwork assignment enables FL training over heterogeneous mobile devices, it is unaware of (i) the dynamic changes of devices' communication and computing conditions and (ii) FL training progress and its dynamic requirements of local training contributions, both of which may cause very long FL training delay. Motivated by those dynamics, in this paper, we develop a wireless and heterogeneity aware latency efficient FL (WHALE-FL) approach to accelerate FL training through adaptive subnetwork scheduling. Instead of sticking to the fixed size subnetwork, WHALE-FL introduces a novel subnetwork selection utility function to capture device and FL training dynamics, and guides the mobile device to adaptively select the subnetwork size for local training based on (a) its computing and communication capacity, (b) its dynamic computing and/or communication conditions, and (c) FL training status and its corresponding requirements for local training contributions. Our evaluation shows that, compared with peer designs, WHALE-FL effectively accelerates FL training without sacrificing learning accuracy.

WHALE-FL: Wireless and Heterogeneity Aware Latency Efficient Federated Learning over Mobile Devices via Adaptive Subnetwork Scheduling

TL;DR

Federated learning on mobile devices suffers from compute and wireless heterogeneity, creating latency and stragglers. WHALE-FL introduces adaptive, width-based subnetwork scheduling guided by a joint utility that combines system efficiency and training efficiency via windowed Fisher information, enabling per-round, device-specific subnetwork sizing. The approach maps utility to discrete subnetworks and aggregates heterogeneous updates, with a WHALE-FL prototype showing substantial latency reductions (≈1.5x–2.1x) across MNIST, CIFAR-10, HAR, and WikiText-2 without sacrificing accuracy, and with analysis of Fisher information dynamics and hyperparameter sensitivity. This work enables scalable, fast FL on real-world, heterogeneous mobile devices by accommodating dynamic system and training demands in the scheduling policy.

Abstract

As a popular distributed learning paradigm, federated learning (FL) over mobile devices fosters numerous applications, while their practical deployment is hindered by participating devices' computing and communication heterogeneity. Some pioneering research efforts proposed to extract subnetworks from the global model, and assign as large a subnetwork as possible to the device for local training based on its full computing and communications capacity. Although such fixed size subnetwork assignment enables FL training over heterogeneous mobile devices, it is unaware of (i) the dynamic changes of devices' communication and computing conditions and (ii) FL training progress and its dynamic requirements of local training contributions, both of which may cause very long FL training delay. Motivated by those dynamics, in this paper, we develop a wireless and heterogeneity aware latency efficient FL (WHALE-FL) approach to accelerate FL training through adaptive subnetwork scheduling. Instead of sticking to the fixed size subnetwork, WHALE-FL introduces a novel subnetwork selection utility function to capture device and FL training dynamics, and guides the mobile device to adaptively select the subnetwork size for local training based on (a) its computing and communication capacity, (b) its dynamic computing and/or communication conditions, and (c) FL training status and its corresponding requirements for local training contributions. Our evaluation shows that, compared with peer designs, WHALE-FL effectively accelerates FL training without sacrificing learning accuracy.
Paper Structure (22 sections, 1 theorem, 30 equations, 10 figures, 3 tables)

This paper contains 22 sections, 1 theorem, 30 equations, 10 figures, 3 tables.

Key Result

Theorem 1

Let all assumptions hold. Suppose that the step size $\gamma$ satisfies $0 \leq \gamma \leq \min\left\{ \frac{1}{12TL}, \frac{|\mathcal{M}^*|}{16TL\sqrt{N}}, \left(\frac{|\mathcal{M}^*|}{768T^3L^3N}\right)^{\frac{1}{3}}\right\}$. Then, for all $Q\geq 1$, we have:

Figures (10)

  • Figure 1: Performance comparison of different FL training approaches under various learning tasks. Figures from left to right are CNN@MNIST, ResNet$18$@CIFAR$10$, Transformer@WikiText$2$, and CNN@HAR with non-IID datasets.
  • Figure 2: Fisher information and subnetwork size level changes over training time (CNN@MNIST). From left to right, the performance of the user-side models on MacBookPro 2018, NVIDIA Jetson TX2, and Raspberry Pi 4, as well as the global model's performance, are shown.
  • Figure 3: Performance comparison of WHALE-FL, system efficiency only and training efficiency only designs (ResNet$18$@CIFAR$10$).
  • Figure : (a) $\beta$, CNN@MNIST.
  • Figure : (a) $\beta$, CNN@MNIST.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Theorem 1