Not All Federated Learning Algorithms Are Created Equal: A Performance Evaluation Study
Gustav A. Baumgart, Jaemin Shin, Ali Payani, Myungjin Lee, Ramana Rao Kompella
TL;DR
This work addresses the lack of holistic evaluation for federated learning (FL) algorithms by comparing six canonical methods (FedAvg, FedProx, FedYogi, FedAdam, SCAFFOLD, FedDyn) using the Flame framework across diverse hardware, model architectures, and non-IID data. It jointly analyzes time-to-target accuracy, computation and communication overheads, stability across clients, and training instability, revealing that no single algorithm excels on all fronts. FedDyn often delivers higher accuracy per fixed rounds but incurs larger wall-clock time and greater risk of instability without gradient clipping, while server-side optimizers like FedAdam and FedYogi show robust stability with lower overhead. The study provides practical guidance for selecting FL algorithms in real deployments and highlights the importance of including system-level metrics in evaluation practices.
Abstract
Federated Learning (FL) emerged as a practical approach to training a model from decentralized data. The proliferation of FL led to the development of numerous FL algorithms and mechanisms. Many prior efforts have given their primary focus on accuracy of those approaches, but there exists little understanding of other aspects such as computational overheads, performance and training stability, etc. To bridge this gap, we conduct extensive performance evaluation on several canonical FL algorithms (FedAvg, FedProx, FedYogi, FedAdam, SCAFFOLD, and FedDyn) by leveraging an open-source federated learning framework called Flame. Our comprehensive measurement study reveals that no single algorithm works best across different performance metrics. A few key observations are: (1) While some state-of-the-art algorithms achieve higher accuracy than others, they incur either higher computation overheads (FedDyn) or communication overheads (SCAFFOLD). (2) Recent algorithms present smaller standard deviation in accuracy across clients than FedAvg, indicating that the advanced algorithms' performances are stable. (3) However, algorithms such as FedDyn and SCAFFOLD are more prone to catastrophic failures without the support of additional techniques such as gradient clipping. We hope that our empirical study can help the community to build best practices in evaluating FL algorithms.
