Table of Contents
Fetching ...

Not All Federated Learning Algorithms Are Created Equal: A Performance Evaluation Study

Gustav A. Baumgart, Jaemin Shin, Ali Payani, Myungjin Lee, Ramana Rao Kompella

TL;DR

This work addresses the lack of holistic evaluation for federated learning (FL) algorithms by comparing six canonical methods (FedAvg, FedProx, FedYogi, FedAdam, SCAFFOLD, FedDyn) using the Flame framework across diverse hardware, model architectures, and non-IID data. It jointly analyzes time-to-target accuracy, computation and communication overheads, stability across clients, and training instability, revealing that no single algorithm excels on all fronts. FedDyn often delivers higher accuracy per fixed rounds but incurs larger wall-clock time and greater risk of instability without gradient clipping, while server-side optimizers like FedAdam and FedYogi show robust stability with lower overhead. The study provides practical guidance for selecting FL algorithms in real deployments and highlights the importance of including system-level metrics in evaluation practices.

Abstract

Federated Learning (FL) emerged as a practical approach to training a model from decentralized data. The proliferation of FL led to the development of numerous FL algorithms and mechanisms. Many prior efforts have given their primary focus on accuracy of those approaches, but there exists little understanding of other aspects such as computational overheads, performance and training stability, etc. To bridge this gap, we conduct extensive performance evaluation on several canonical FL algorithms (FedAvg, FedProx, FedYogi, FedAdam, SCAFFOLD, and FedDyn) by leveraging an open-source federated learning framework called Flame. Our comprehensive measurement study reveals that no single algorithm works best across different performance metrics. A few key observations are: (1) While some state-of-the-art algorithms achieve higher accuracy than others, they incur either higher computation overheads (FedDyn) or communication overheads (SCAFFOLD). (2) Recent algorithms present smaller standard deviation in accuracy across clients than FedAvg, indicating that the advanced algorithms' performances are stable. (3) However, algorithms such as FedDyn and SCAFFOLD are more prone to catastrophic failures without the support of additional techniques such as gradient clipping. We hope that our empirical study can help the community to build best practices in evaluating FL algorithms.

Not All Federated Learning Algorithms Are Created Equal: A Performance Evaluation Study

TL;DR

This work addresses the lack of holistic evaluation for federated learning (FL) algorithms by comparing six canonical methods (FedAvg, FedProx, FedYogi, FedAdam, SCAFFOLD, FedDyn) using the Flame framework across diverse hardware, model architectures, and non-IID data. It jointly analyzes time-to-target accuracy, computation and communication overheads, stability across clients, and training instability, revealing that no single algorithm excels on all fronts. FedDyn often delivers higher accuracy per fixed rounds but incurs larger wall-clock time and greater risk of instability without gradient clipping, while server-side optimizers like FedAdam and FedYogi show robust stability with lower overhead. The study provides practical guidance for selecting FL algorithms in real deployments and highlights the importance of including system-level metrics in evaluation practices.

Abstract

Federated Learning (FL) emerged as a practical approach to training a model from decentralized data. The proliferation of FL led to the development of numerous FL algorithms and mechanisms. Many prior efforts have given their primary focus on accuracy of those approaches, but there exists little understanding of other aspects such as computational overheads, performance and training stability, etc. To bridge this gap, we conduct extensive performance evaluation on several canonical FL algorithms (FedAvg, FedProx, FedYogi, FedAdam, SCAFFOLD, and FedDyn) by leveraging an open-source federated learning framework called Flame. Our comprehensive measurement study reveals that no single algorithm works best across different performance metrics. A few key observations are: (1) While some state-of-the-art algorithms achieve higher accuracy than others, they incur either higher computation overheads (FedDyn) or communication overheads (SCAFFOLD). (2) Recent algorithms present smaller standard deviation in accuracy across clients than FedAvg, indicating that the advanced algorithms' performances are stable. (3) However, algorithms such as FedDyn and SCAFFOLD are more prone to catastrophic failures without the support of additional techniques such as gradient clipping. We hope that our empirical study can help the community to build best practices in evaluating FL algorithms.
Paper Structure (15 sections, 3 equations, 4 figures, 2 tables)

This paper contains 15 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Test accuracy for CIFAR-10 dataset with $\mathrm{Dir}(0.3)$ and 100 clients. We compare the accuracy across algorithms by choosing round or time on the x-axis.
  • Figure 2: The relative runtime overhead of algorithms compared to FedAvg's runtime with CNN (798K parameters), ResNet18 (11.7M), and ResNet34 (21.8M).
  • Figure 3: The relative runtime overhead of algorithms compared to FedAvg's runtime with LSTM-2 (134K parameters), LSTM-10 (780K), and LSTM-20 (1.59M).
  • Figure 4: Violin plots of local test accuracies for CIFAR-10 and Shakespeare for 5 different trials (a total of 500 values for each violin plot). These plots represent the distribution of local test accuracies across different runs.