Table of Contents
Fetching ...

BlazeFL: Fast and Deterministic Federated Learning Simulation

Kitsuya Azuma, Takayuki Nishio

Abstract

Federated learning (FL) research increasingly relies on single-node simulations with hundreds or thousands of virtual clients, making both efficiency and reproducibility essential. Yet parallel client training often introduces nondeterminism through shared random state and scheduling variability, forcing researchers to trade throughput for reproducibility or to implement custom control logic within complex frameworks. We present BlazeFL, a lightweight framework for single-node FL simulation that alleviates this trade-off through free-threaded shared-memory execution and deterministic randomness management. BlazeFL uses thread-based parallelism with in-memory parameter exchange between the server and clients, avoiding serialization and inter-process communication overhead. To support deterministic execution, BlazeFL assigns isolated random number generator (RNG) streams to clients. Under a fixed software/hardware stack, and when stochastic operators consume BlazeFL-managed generators, this design yields bitwise-identical results across repeated high-concurrency runs in both thread-based and process-based modes. In CIFAR-10 image-classification experiments, BlazeFL substantially reduces execution time relative to a widely used open-source baseline, achieving up to 3.1$\times$ speedup on communication-dominated workloads while preserving a lightweight dependency footprint. Our open-source implementation is available at: https://github.com/kitsuyaazuma/blazefl.

BlazeFL: Fast and Deterministic Federated Learning Simulation

Abstract

Federated learning (FL) research increasingly relies on single-node simulations with hundreds or thousands of virtual clients, making both efficiency and reproducibility essential. Yet parallel client training often introduces nondeterminism through shared random state and scheduling variability, forcing researchers to trade throughput for reproducibility or to implement custom control logic within complex frameworks. We present BlazeFL, a lightweight framework for single-node FL simulation that alleviates this trade-off through free-threaded shared-memory execution and deterministic randomness management. BlazeFL uses thread-based parallelism with in-memory parameter exchange between the server and clients, avoiding serialization and inter-process communication overhead. To support deterministic execution, BlazeFL assigns isolated random number generator (RNG) streams to clients. Under a fixed software/hardware stack, and when stochastic operators consume BlazeFL-managed generators, this design yields bitwise-identical results across repeated high-concurrency runs in both thread-based and process-based modes. In CIFAR-10 image-classification experiments, BlazeFL substantially reduces execution time relative to a widely used open-source baseline, achieving up to 3.1 speedup on communication-dominated workloads while preserving a lightweight dependency footprint. Our open-source implementation is available at: https://github.com/kitsuyaazuma/blazefl.

Paper Structure

This paper contains 24 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Architecture overview of BlazeFL. A main thread coordinates client scheduling, while worker threads execute within a shared address space, enabling server-to-client parameter broadcast and client-to-server uploads without cross-process serialization or IPC. Each client is associated with an isolated RNG stream to support deterministic repeated execution under controlled settings.
  • Figure 2: Wall-clock time for five communication rounds on the high-performance server (48 CPU cores, NVIDIA H100) as a function of client parallelism $P$. Timings include client training, server aggregation, and global evaluation, but exclude dataset download and partition generation. Comparison of BlazeFL (free-threaded), BlazeFL (process-based shared memory), and Flower (Ray backend). Lower is better. Missing points for BlazeFL (process-based) indicate execution failure due to CUDA out-of-memory errors.
  • Figure 3: Wall-clock time for five communication rounds on the workstation-class server (32 CPU cores, NVIDIA Quadro RTX 6000) as a function of client parallelism $P$. Timings include client training, server aggregation, and global evaluation, but exclude dataset download and partition generation. Comparison of BlazeFL (free-threaded), BlazeFL (process-based shared memory), and Flower (Ray backend). Lower is better.
  • Figure 4: Accumulation of non-deterministic errors in Flower across 10 runs with manual global seeding. The $y$-axis represents the $L_2$ distance between each run's client logits and the mean logits of all runs at the start of each communication round. The trajectories fan out as floating-point rounding differences from non-deterministic aggregation order compound over time.