Table of Contents
Fetching ...

Pollen: High-throughput Federated Learning Simulation via Resource-Aware Client Placement

Lorenzo Sani, Pedro Porto Buarque de Gusmão, Alex Iacob, Wanru Zhao, Xinchi Qiu, Yan Gao, Javier Fernandez-Marques, Nicholas Donald Lane

TL;DR

Pollen tackles the bottlenecks of large-scale federated learning simulations by introducing a push-based client placement, a concurrency-aware scheduling model, and scalable partial aggregation to dramatically reduce communication and idle time. The learning-based placement predicts per-client training times using a robust log-linear model with adaptive correction, enabling effective distribution of workloads across heterogeneous GPUs. Across four FL tasks and multi-node hardware setups, Pollen achieves significant speedups over existing simulators and outperforms pfl, making realistic production-scale experiments feasible within weeks rather than months. This work offers a practical, scalable, and adaptable framework that can accelerate FL research and prototyping on diverse hardware, with broad benefits for researchers and industry teams alike.

Abstract

Federated Learning (FL) is a privacy-focused machine learning paradigm that collaboratively trains models directly on edge devices. Simulation plays an essential role in FL adoption, helping develop novel aggregation and client sampling strategies. However, current simulators cannot emulate large-scale systems in a time-efficient manner, which limits their utility and casts doubts on generalizability. This work proposes Pollen, a novel resource-aware system for speeding up simulations. Pollen addresses two limiting factors from existing simulators: (a) communication inefficiency derived from pull-based client execution and (b) inadequate load balance when using heterogeneous hardware. Pollen executes high-throughput FL simulations at scale by (a) using a push-based client placement system, (b) learning how an adaptable scheduling of clients based on hardware statistics (c) estimating the optimal number of concurrent workers per GPU. We evaluate Pollen on four representative FL tasks and show that Pollen's placement model increases GPU utilization and reduces idle time. We compare Pollen to Flower, Flute, FedScale, Parrot, and pfl and show experimental speed-ups of days or weeks.

Pollen: High-throughput Federated Learning Simulation via Resource-Aware Client Placement

TL;DR

Pollen tackles the bottlenecks of large-scale federated learning simulations by introducing a push-based client placement, a concurrency-aware scheduling model, and scalable partial aggregation to dramatically reduce communication and idle time. The learning-based placement predicts per-client training times using a robust log-linear model with adaptive correction, enabling effective distribution of workloads across heterogeneous GPUs. Across four FL tasks and multi-node hardware setups, Pollen achieves significant speedups over existing simulators and outperforms pfl, making realistic production-scale experiments feasible within weeks rather than months. This work offers a practical, scalable, and adaptable framework that can accelerate FL research and prototyping on diverse hardware, with broad benefits for researchers and industry teams alike.

Abstract

Federated Learning (FL) is a privacy-focused machine learning paradigm that collaboratively trains models directly on edge devices. Simulation plays an essential role in FL adoption, helping develop novel aggregation and client sampling strategies. However, current simulators cannot emulate large-scale systems in a time-efficient manner, which limits their utility and casts doubts on generalizability. This work proposes Pollen, a novel resource-aware system for speeding up simulations. Pollen addresses two limiting factors from existing simulators: (a) communication inefficiency derived from pull-based client execution and (b) inadequate load balance when using heterogeneous hardware. Pollen executes high-throughput FL simulations at scale by (a) using a push-based client placement system, (b) learning how an adaptable scheduling of clients based on hardware statistics (c) estimating the optimal number of concurrent workers per GPU. We evaluate Pollen on four representative FL tasks and show that Pollen's placement model increases GPU utilization and reduces idle time. We compare Pollen to Flower, Flute, FedScale, Parrot, and pfl and show experimental speed-ups of days or weeks.
Paper Structure (36 sections, 3 equations, 22 figures, 9 tables)

This paper contains 36 sections, 3 equations, 22 figures, 9 tables.

Figures (22)

  • Figure 1: For large-scale experiments of $5$ million clients trained across 5000.0 rounds, Pollen makes possible previously unfeasible experiments and outperforms all other frameworks. For example, on the Image Classification task Pollen executes in less than one week, outmatching the two-week training time of its competitors. \ref{['sec:experimental_design']} describes the design of this experiment. We also provide a comparison against the recently released foot:pfl_release pflpfl_paper framework in \ref{['sec:pfl_comparison']}.
  • Figure 2: Dataset size distribution over clients for OpenImage, Google Speech, Shakespeare, and Reddit. The x-axis is non-linear, and the Reddit dataset is subsampled for comparison. Real-world federated datasets are naturally unbalanced, representing a challenge when optimizing resource utilization for machine learning algorithms.
  • Figure 3: GPU utilization on two A40 GPUs training clients having the same number of samples (left) versus clients having different numbers of samples following a naturally partitioned dataset, i.e. OpenImage (right). While the ideal scenario (left) obtains balanced GPU idle times of $12.3$ and $12.6$ seconds, respectively, the naturally partitioned one results in $16.5$ and $40.5$ seconds, respectively. The comparison showcases the detrimental impact that unbalanced real-world datasets, such as those in \ref{['fig:client_heterogeneity']}, may have on GPU utilization.
  • Figure 4: The number of worker processes has optimum depending on the hardware available and the task-specific workload. In this experiment, the same FL simulation is performed with a different number of worker processes. While the single-client training time increases proportionally to the number of workers, the total runtime of the experiment decreases until it reaches the optimum of 5 workers.
  • Figure 5: Training times for two GPUs, Nvidia A40 and Nvidia RTX 2080, running on clients with different dataset sizes. Since the GPUs run the same clients, their diversity is reflected by the different distribution of training times with the same number of batches trained, the different trends they produce as the number of batches increases, and the pattern of the fluctuations when the training time starts to plateau.
  • ...and 17 more figures