A Robust Federated Learning Framework for Undependable Devices at Scale

Shilong Wang; Jianchun Liu; Hongli Xu; Chunming Qiao; Huarong Deng; Qiuye Zheng; Jiantao Gong

A Robust Federated Learning Framework for Undependable Devices at Scale

Shilong Wang, Jianchun Liu, Hongli Xu, Chunming Qiao, Huarong Deng, Qiuye Zheng, Jiantao Gong

TL;DR

The paper tackles the problem of device undependability in federated learning by introducing FLUDE, a framework that combines adaptive device selection, local model caching, and staleness aware model distribution to sustain model quality while reducing wasted resources. It estimates device dependability with Beta-distributed priors and online updates, uses a rolling local cache to preserve progress, and distributes the latest global model only to suitably fresh or constrained devices based on a dynamic staleness threshold expressed by $W_{new}$. The approach is validated on two physical platforms with 40 OPPO smartphones and 80 NVIDIA Jetson devices, showing improvements in final accuracy ($2.28\%$–$7.43\%$), time-to-accuracy ($1.2$×–$3.2$×), and communication costs ($23.71\%$–$40.71\%$) across CIFAR-10/100, Google Speech, and Avazu tasks. Overall, FLUDE demonstrates stronger robustness and efficiency in realistic, large-scale FL settings, with potential extensions to semi supervised scenarios.

Abstract

In a federated learning (FL) system, many devices, such as smartphones, are often undependable (e.g., frequently disconnected from WiFi) during training. Existing FL frameworks always assume a dependable environment and exclude undependable devices from training, leading to poor model performance and resource wastage. In this paper, we propose FLUDE to effectively deal with undependable environments. First, FLUDE assesses the dependability of devices based on the probability distribution of their historical behaviors (e.g., the likelihood of successfully completing training). Based on this assessment, FLUDE adaptively selects devices with high dependability for training. To mitigate resource wastage during the training phase, FLUDE maintains a model cache on each device, aiming to preserve the latest training state for later use in case local training on an undependable device is interrupted. Moreover, FLUDE proposes a staleness-aware strategy to judiciously distribute the global model to a subset of devices, thus significantly reducing resource wastage while maintaining model performance. We have implemented FLUDE on two physical platforms with 120 smartphones and NVIDIA Jetson devices. Extensive experimental results demonstrate that FLUDE can effectively improve model performance and resource efficiency of FL training in undependable environments.

A Robust Federated Learning Framework for Undependable Devices at Scale

TL;DR

. The approach is validated on two physical platforms with 40 OPPO smartphones and 80 NVIDIA Jetson devices, showing improvements in final accuracy (

–

), time-to-accuracy (

×–

×), and communication costs (

–

) across CIFAR-10/100, Google Speech, and Avazu tasks. Overall, FLUDE demonstrates stronger robustness and efficiency in realistic, large-scale FL settings, with potential extensions to semi supervised scenarios.

Abstract

Paper Structure (20 sections, 4 equations, 9 figures, 2 tables, 2 algorithms)

This paper contains 20 sections, 4 equations, 9 figures, 2 tables, 2 algorithms.

Introduction
Background and Motivation
Key Observations in FL Systems
Limitations of Existing FL Systems
Limitations in Model Performance
Limitations in Resource Efficiency
Overview of FLUDE
Detail Design of FLUDE
Adaptive Device Selection
Local Model Caching
Staleness-Aware Model Distribution
Integrating Modules into FLUDE
Performance Evaluation
System Implementation
Experimental Settings
...and 5 more sections

Figures (9)

Figure 1: Training performance comparison under dependable and undependable environments. (a) The global model accuracy. (b) The model accuracy (bars) and volumes of data involved in federated training (lines) across data classes. (c) The model accuracy (bars) and participation frequency (lines) across devices.
Figure 2: Communication costs to reach the target accuracy of 45% for training CNN on CIFAR-10.
Figure 3: Overview and workflow of FLUDE.
Figure 4: Performance comparison of time-to-accuracy between FLUDE and the baselines.
Figure 5: Comparison of communication costs between FLUDE and the baselines.
...and 4 more figures

A Robust Federated Learning Framework for Undependable Devices at Scale

TL;DR

Abstract

A Robust Federated Learning Framework for Undependable Devices at Scale

Authors

TL;DR

Abstract

Table of Contents

Figures (9)