pfl-research: simulation framework for accelerating research in Private Federated Learning

Filip Granqvist; Congzheng Song; Áine Cahill; Rogier van Dalen; Martin Pelikan; Yi Sheng Chan; Xiaojun Feng; Natarajan Krishnaswami; Vojta Jina; Mona Chitnis

pfl-research: simulation framework for accelerating research in Private Federated Learning

Filip Granqvist, Congzheng Song, Áine Cahill, Rogier van Dalen, Martin Pelikan, Yi Sheng Chan, Xiaojun Feng, Natarajan Krishnaswami, Vojta Jina, Mona Chitnis

TL;DR

pfl-research addresses the need for fast, scalable simulation of private federated learning by decoupling computation from real-world topology and enabling distributed experiments with GPU-accelerated DP postprocessing. The framework is modular and Python-based, integrating PyTorch, TensorFlow, Horovod, and pluggable privacy accounting to support a wide range of PFL scenarios. Empirical results show speedups of $7×$ to $72×$ over existing simulators and demonstrate scalable distributed training across diverse benchmarks, including CIFAR10, StackOverflow, FLAIR, and LLM tasks. This work lowers resource barriers for FL/PFL researchers and lays groundwork for community-driven extensions and richer cross-framework benchmarks.

Abstract

Federated learning (FL) is an emerging machine learning (ML) training paradigm where clients own their data and collaborate to train a global model, without revealing any data to the server and other participants. Researchers commonly perform experiments in a simulation environment to quickly iterate on ideas. However, existing open-source tools do not offer the efficiency required to simulate FL on larger and more realistic FL datasets. We introduce pfl-research, a fast, modular, and easy-to-use Python framework for simulating FL. It supports TensorFlow, PyTorch, and non-neural network models, and is tightly integrated with state-of-the-art privacy algorithms. We study the speed of open-source FL frameworks and show that pfl-research is 7-72$\times$ faster than alternative open-source frameworks on common cross-device setups. Such speedup will significantly boost the productivity of the FL research community and enable testing hypotheses on realistic FL datasets that were previously too resource intensive. We release a suite of benchmarks that evaluates an algorithm's overall performance on a diverse set of realistic scenarios. The code is available on GitHub at https://github.com/apple/pfl-research.

pfl-research: simulation framework for accelerating research in Private Federated Learning

TL;DR

over existing simulators and demonstrate scalable distributed training across diverse benchmarks, including CIFAR10, StackOverflow, FLAIR, and LLM tasks. This work lowers resource barriers for FL/PFL researchers and lays groundwork for community-driven extensions and richer cross-framework benchmarks.

Abstract

faster than alternative open-source frameworks on common cross-device setups. Such speedup will significantly boost the productivity of the FL research community and enable testing hypotheses on realistic FL datasets that were previously too resource intensive. We release a suite of benchmarks that evaluates an algorithm's overall performance on a diverse set of realistic scenarios. The code is available on GitHub at https://github.com/apple/pfl-research.

Paper Structure (37 sections, 3 equations, 10 figures, 13 tables)

This paper contains 37 sections, 3 equations, 10 figures, 13 tables.

Introduction
Related Work
System design
Distributed simulations
Experiments
Comparing performance to alternative FL simulators
Scaling distributed training
Benchmarks for research
LLM benchmarks
Future work
Conclusion
Preliminaries
Federated Learning.
Differential Privacy.
Detailed system design
...and 22 more sections

Figures (10)

Figure 1: (a) The (simplified) architecture of distributed simulations. Each process is a replica with a different distributed context. One synchronous communication step is done to aggregate gradients and metrics. (b) Each process has a balanced queue of users to train.
Figure 2: Speedup from scaling up number of processes per GPU in distributed simulations, while keeping the hardware resources fixed. As long as number of GPUs $\ll$ cohort size (unlike the blue line, where we use an unnecessarily large amount of GPUs given the size of model, users and cohort), the wall-clock time is monotonically decreasing when increasing number of models to train in parallel on the same GPU.
Figure 3: Speedup from scaling up distributed simulations. Left panel: Sweep number of processes per GPU on CIFAR10, StackOverflow and FLAIR benchmarks and keep the hardware resources pinned. Right panel: Sweep the number of GPUs to train the StackOverflow benchmark, repeat for 1, 3 and 5 processes per GPU. Blue lines (tracked on left y-axis) show wall-clock time. Green lines (tracked in right y-axis) show total GPU hours for the same experiments. Note that the runs with >8 GPUs use multiple hosts and thus become slightly less efficient.
Figure 4: Generalized federated learning simulation.
Figure 5: FedAvg using pfl-research interfaces
...and 5 more figures

Theorems & Definitions (1)

Definition A.1: Differential privacy

pfl-research: simulation framework for accelerating research in Private Federated Learning

TL;DR

Abstract

pfl-research: simulation framework for accelerating research in Private Federated Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (10)

Theorems & Definitions (1)