FedPS: Federated data Preprocessing via aggregated Statistics
Xuefeng Xu, Graham Cormode
TL;DR
FedPS addresses the neglected bottleneck of data preprocessing in federated learning by introducing a unified framework that leverages aggregated statistics and data sketches to produce consistent, privacy-preserving preprocessing across clients. It extends standard preprocessing techniques to distributed settings and analyzes their communication costs, while also developing federated Bayesian linear regression for both horizontal and vertical FL to enable model-based imputation and transformation. Empirical results on public tabular datasets show that federated preprocessing substantially improves accuracy over raw or locally preprocessed data, particularly under non-IID conditions, with predictable communication overhead. By bridging preprocessing with federated modeling primitives, FedPS enables practical, scalable, and robust FL deployments.
Abstract
Federated Learning (FL) enables multiple parties to collaboratively train machine learning models without sharing raw data. However, before training, data must be preprocessed to address missing values, inconsistent formats, and heterogeneous feature scales. This preprocessing stage is critical for model performance but is largely overlooked in FL research. In practical FL systems, privacy constraints prohibit centralizing raw data, while communication efficiency introduces further challenges for distributed preprocessing. We introduce FedPS, a unified framework for federated data preprocessing based on aggregated statistics. FedPS leverages data-sketching techniques to efficiently summarize local datasets while preserving essential statistical information. Building on these summaries, we design federated algorithms for feature scaling, encoding, discretization, and missing-value imputation, and extend preprocessing-related models such as k-Means, k-Nearest Neighbors, and Bayesian Linear Regression to both horizontal and vertical FL settings. FedPS provides flexible, communication-efficient, and consistent preprocessing pipelines for practical FL deployments.
