A Robust Federated Learning Framework for Undependable Devices at Scale
Shilong Wang, Jianchun Liu, Hongli Xu, Chunming Qiao, Huarong Deng, Qiuye Zheng, Jiantao Gong
TL;DR
The paper tackles the problem of device undependability in federated learning by introducing FLUDE, a framework that combines adaptive device selection, local model caching, and staleness aware model distribution to sustain model quality while reducing wasted resources. It estimates device dependability with Beta-distributed priors and online updates, uses a rolling local cache to preserve progress, and distributes the latest global model only to suitably fresh or constrained devices based on a dynamic staleness threshold expressed by $W_{new}$. The approach is validated on two physical platforms with 40 OPPO smartphones and 80 NVIDIA Jetson devices, showing improvements in final accuracy ($2.28\%$–$7.43\%$), time-to-accuracy ($1.2$×–$3.2$×), and communication costs ($23.71\%$–$40.71\%$) across CIFAR-10/100, Google Speech, and Avazu tasks. Overall, FLUDE demonstrates stronger robustness and efficiency in realistic, large-scale FL settings, with potential extensions to semi supervised scenarios.
Abstract
In a federated learning (FL) system, many devices, such as smartphones, are often undependable (e.g., frequently disconnected from WiFi) during training. Existing FL frameworks always assume a dependable environment and exclude undependable devices from training, leading to poor model performance and resource wastage. In this paper, we propose FLUDE to effectively deal with undependable environments. First, FLUDE assesses the dependability of devices based on the probability distribution of their historical behaviors (e.g., the likelihood of successfully completing training). Based on this assessment, FLUDE adaptively selects devices with high dependability for training. To mitigate resource wastage during the training phase, FLUDE maintains a model cache on each device, aiming to preserve the latest training state for later use in case local training on an undependable device is interrupted. Moreover, FLUDE proposes a staleness-aware strategy to judiciously distribute the global model to a subset of devices, thus significantly reducing resource wastage while maintaining model performance. We have implemented FLUDE on two physical platforms with 120 smartphones and NVIDIA Jetson devices. Extensive experimental results demonstrate that FLUDE can effectively improve model performance and resource efficiency of FL training in undependable environments.
