Table of Contents
Fetching ...

Efficient Data Distribution Estimation for Accelerated Federated Learning

Yuanli Wang, Lei Huang

TL;DR

This work tackles the overhead of device selection in large-scale Federated Learning under non-IID data and heterogeneous devices. It introduces an encoder-based dimension-reduction plus coreset approach to produce compact distribution summaries of size $C*H+C$ and clusters devices with K-means to guide client selection. Empirical results on FEMNIST and OpenImage show substantial efficiency gains, up to 30x data-summary time reduction and up to 360x clustering time reduction, enabling scalable adaptive FL. The method can complement privacy-preserving techniques and offers a practical path toward robust, real-world FL deployment with dynamic resource availability.

Abstract

Federated Learning(FL) is a privacy-preserving machine learning paradigm where a global model is trained in-situ across a large number of distributed edge devices. These systems are often comprised of millions of user devices and only a subset of available devices can be used for training in each epoch. Designing a device selection strategy is challenging, given that devices are highly heterogeneous in both their system resources and training data. This heterogeneity makes device selection very crucial for timely model convergence and sufficient model accuracy. To tackle the FL client heterogeneity problem, various client selection algorithms have been developed, showing promising performance improvement in terms of model coverage and accuracy. In this work, we study the overhead of client selection algorithms in a large scale FL environment. Then we propose an efficient data distribution summary calculation algorithm to reduce the overhead in a real-world large scale FL environment. The evaluation shows that our proposed solution could achieve up to 30x reduction in data summary time, and up to 360x reduction in clustering time.

Efficient Data Distribution Estimation for Accelerated Federated Learning

TL;DR

This work tackles the overhead of device selection in large-scale Federated Learning under non-IID data and heterogeneous devices. It introduces an encoder-based dimension-reduction plus coreset approach to produce compact distribution summaries of size and clusters devices with K-means to guide client selection. Empirical results on FEMNIST and OpenImage show substantial efficiency gains, up to 30x data-summary time reduction and up to 360x clustering time reduction, enabling scalable adaptive FL. The method can complement privacy-preserving techniques and offers a practical path toward robust, real-world FL deployment with dynamic resource availability.

Abstract

Federated Learning(FL) is a privacy-preserving machine learning paradigm where a global model is trained in-situ across a large number of distributed edge devices. These systems are often comprised of millions of user devices and only a subset of available devices can be used for training in each epoch. Designing a device selection strategy is challenging, given that devices are highly heterogeneous in both their system resources and training data. This heterogeneity makes device selection very crucial for timely model convergence and sufficient model accuracy. To tackle the FL client heterogeneity problem, various client selection algorithms have been developed, showing promising performance improvement in terms of model coverage and accuracy. In this work, we study the overhead of client selection algorithms in a large scale FL environment. Then we propose an efficient data distribution summary calculation algorithm to reduce the overhead in a real-world large scale FL environment. The evaluation shows that our proposed solution could achieve up to 30x reduction in data summary time, and up to 360x reduction in clustering time.
Paper Structure (9 sections, 2 equations, 1 figure, 2 tables)

This paper contains 9 sections, 2 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: system overview and workflow of query deployment.