An Empirical Study of the Impact of Federated Learning on Machine Learning Model Accuracy
Haotian Yang, Zhuoran Wang, Benson Chou, Sophie Xu, Hao Wang, Jingxian Wang, Qizhen Zhang
TL;DR
This study provides a comprehensive empirical evaluation of how Federated Learning affects the accuracy of state-of-the-art models across text, image, audio, and video tasks using a unified Flower-based framework. By systematically varying data distribution (non-IID and volume skew), client sampling, FL scale, local learning (batch size and epochs), and global federation strategies (FedAvg, FedAdam, FedYogi), the authors reveal task-dependent accuracy impacts and practical guidelines for deployment. Key findings include the strong sensitivity to non-IID distributions, limited impact of volume skew, and the continued competitiveness of FedAvg as a baseline with selective gains from adaptive optimizers, plus specific needs for architecture adjustments (e.g., GroupNorm with FedYogi for certain models). The work provides actionable insights for practitioners and contributes open-source, end-to-end FL experiments for diverse data modalities, totaling about $6.2K$ GPU hours, to inform robust FL deployments in privacy-preserving settings.
Abstract
Federated Learning (FL) enables distributed ML model training on private user data at the global scale. Despite the potential of FL demonstrated in many domains, an in-depth view of its impact on model accuracy remains unclear. In this paper, we investigate, systematically, how this learning paradigm can affect the accuracy of state-of-the-art ML models for a variety of ML tasks. We present an empirical study that involves various data types: text, image, audio, and video, and FL configuration knobs: data distribution, FL scale, client sampling, and local and global computations. Our experiments are conducted in a unified FL framework to achieve high fidelity, with substantial human efforts and resource investments. Based on the results, we perform a quantitative analysis of the impact of FL, and highlight challenging scenarios where applying FL degrades the accuracy of the model drastically and identify cases where the impact is negligible. The detailed and extensive findings can benefit practical deployments and future development of FL.
