Table of Contents
Fetching ...

Decoding FL Defenses: Systemization, Pitfalls, and Remedies

Momin Ahmad Khan, Virat Shejwalkar, Yasra Chandio, Amir Houmansadr, Fatima Muhammad Anwar

TL;DR

This work introduces a three-dimensional systemization of FL defenses—processing of client updates, server knowledge, and defense phase—and uses it to systematically survey 50 defense papers. It identifies six prevalent pitfalls in defense evaluations and analyzes their impact through three representative defenses (TrMean, FLDetector, and FedRecover) across varied datasets, distributions, and FL settings. The study demonstrates that experimental setups can significantly distort robustness claims, particularly under intrinsically robust datasets, homogeneous data, slow-converging algorithms, and naive attacks, and it provides practical guidelines to combat these issues. The findings underscore the importance of realistic, heterogeneous, and scalable evaluation frameworks and advocate for personalized and per-class metrics to accurately assess FL defense robustness.

Abstract

While the community has designed various defenses to counter the threat of poisoning attacks in Federated Learning (FL), there are no guidelines for evaluating these defenses. These defenses are prone to subtle pitfalls in their experimental setups that lead to a false sense of security, rendering them unsuitable for practical deployment. In this paper, we systematically understand, identify, and provide a better approach to address these challenges. First, we design a comprehensive systemization of FL defenses along three dimensions: i) how client updates are processed, ii) what the server knows, and iii) at what stage the defense is applied. Next, we thoroughly survey 50 top-tier defense papers and identify the commonly used components in their evaluation setups. Based on this survey, we uncover six distinct pitfalls and study their prevalence. For example, we discover that around 30% of these works solely use the intrinsically robust MNIST dataset, and 40% employ simplistic attacks, which may inadvertently portray their defense as robust. Using three representative defenses as case studies, we perform a critical reevaluation to study the impact of the identified pitfalls and show how they lead to incorrect conclusions about robustness. We provide actionable recommendations to help researchers overcome each pitfall.

Decoding FL Defenses: Systemization, Pitfalls, and Remedies

TL;DR

This work introduces a three-dimensional systemization of FL defenses—processing of client updates, server knowledge, and defense phase—and uses it to systematically survey 50 defense papers. It identifies six prevalent pitfalls in defense evaluations and analyzes their impact through three representative defenses (TrMean, FLDetector, and FedRecover) across varied datasets, distributions, and FL settings. The study demonstrates that experimental setups can significantly distort robustness claims, particularly under intrinsically robust datasets, homogeneous data, slow-converging algorithms, and naive attacks, and it provides practical guidelines to combat these issues. The findings underscore the importance of realistic, heterogeneous, and scalable evaluation frameworks and advocate for personalized and per-class metrics to accurately assess FL defense robustness.

Abstract

While the community has designed various defenses to counter the threat of poisoning attacks in Federated Learning (FL), there are no guidelines for evaluating these defenses. These defenses are prone to subtle pitfalls in their experimental setups that lead to a false sense of security, rendering them unsuitable for practical deployment. In this paper, we systematically understand, identify, and provide a better approach to address these challenges. First, we design a comprehensive systemization of FL defenses along three dimensions: i) how client updates are processed, ii) what the server knows, and iii) at what stage the defense is applied. Next, we thoroughly survey 50 top-tier defense papers and identify the commonly used components in their evaluation setups. Based on this survey, we uncover six distinct pitfalls and study their prevalence. For example, we discover that around 30% of these works solely use the intrinsically robust MNIST dataset, and 40% employ simplistic attacks, which may inadvertently portray their defense as robust. Using three representative defenses as case studies, we perform a critical reevaluation to study the impact of the identified pitfalls and show how they lead to incorrect conclusions about robustness. We provide actionable recommendations to help researchers overcome each pitfall.

Paper Structure

This paper contains 50 sections, 4 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: FL defense evaluation pipeline. We display common choices for each stage in the pipeline, e.g., FedSGD or FedAvg as the FL algorithm, and highlight the associated pitfall, i.e., using a slow-converging algorithm. We also indicate interdependencies between stages, e.g., large-scale and cross-device in FL type limit the number of malicious clients in the attack.
  • Figure 2: Systemizing FL Defenses: Categorization of defenses based on their defense phases. Processing operations for client updates are color-coded: filtering (blue), re-weighing (green), and modification (purple). Defenses belonging to the same category of processing may have different phases, e.g., FedRecover performs update modification at the server, while Ditto performs update modification at the client. Dependencies are also highlighted, such as FedRecover utilizing stored local and global models for estimation.
  • Figure 3: Frequency of choices of the six key components of robustness evaluation setup: dataset, distribution of clients' data, FL algorithm, FL type, attacks, and evaluation. § \ref{['sec:impact']} discusses the impacts of choices on the robustness of FL poisoning defenses.
  • Figure 4: Comparative analysis of TrMean AGR with FedSGD and FedAvg under trim attack. TrMean is more susceptible to poisoning with FedSGD due to slow convergence, which gives adversaries more time for poisoning (threat model in §\ref{['background:threat_model']})
  • Figure 5: FEMNIST histogram from leaf.cmu.edu
  • ...and 10 more figures