Table of Contents
Fetching ...

Non-IID data in Federated Learning: A Survey with Taxonomy, Metrics, Methods, Frameworks and Future Directions

Daniel M. Jimenez G., David Solans, Mikko Heikkila, Andrea Vitaletti, Nicolas Kourtellis, Aris Anagnostopoulos, Ioannis Chatzigiannakis

TL;DR

This survey tackles non-IID data in Federated Learning by delivering a formal taxonomy of data heterogeneity, detailing partition protocols and metrics to quantify skew, and surveying practical non-IID solutions and standardized frameworks. It integrates a broad meta-analysis of literature quality, geography, and collaboration patterns, and introduces modality skew as a novel category. The work highlights the lack of consensus on non-IID definitions, the need for joint multi-skew evaluation, and the limited adoption of standardized FL frameworks, motivating future benchmarks and theoretical progress. Overall, the paper provides a comprehensive foundation for robust, reproducible evaluation and development of FL methods under diverse data heterogeneity, with clear directions toward multimodal data, improved metrics, and more efficient, adaptive algorithms.

Abstract

Recent advances in machine learning have highlighted Federated Learning (FL) as a promising approach that enables multiple distributed users (so-called clients) to collectively train ML models without sharing their private data. While this privacy-preserving method shows potential, it struggles when data across clients is not independent and identically distributed (non-IID) data. The latter remains an unsolved challenge that can result in poorer model performance and slower training times. Despite the significance of non-IID data in FL, there is a lack of consensus among researchers about its classification and quantification. This technical survey aims to fill that gap by providing a detailed taxonomy for non-IID data, partition protocols, and metrics to quantify data heterogeneity. Additionally, we describe popular solutions to address non-IID data and standardized frameworks employed in FL with heterogeneous data. Based on our state-of-the-art survey, we present key lessons learned and suggest promising future research directions.

Non-IID data in Federated Learning: A Survey with Taxonomy, Metrics, Methods, Frameworks and Future Directions

TL;DR

This survey tackles non-IID data in Federated Learning by delivering a formal taxonomy of data heterogeneity, detailing partition protocols and metrics to quantify skew, and surveying practical non-IID solutions and standardized frameworks. It integrates a broad meta-analysis of literature quality, geography, and collaboration patterns, and introduces modality skew as a novel category. The work highlights the lack of consensus on non-IID definitions, the need for joint multi-skew evaluation, and the limited adoption of standardized FL frameworks, motivating future benchmarks and theoretical progress. Overall, the paper provides a comprehensive foundation for robust, reproducible evaluation and development of FL methods under diverse data heterogeneity, with clear directions toward multimodal data, improved metrics, and more efficient, adaptive algorithms.

Abstract

Recent advances in machine learning have highlighted Federated Learning (FL) as a promising approach that enables multiple distributed users (so-called clients) to collectively train ML models without sharing their private data. While this privacy-preserving method shows potential, it struggles when data across clients is not independent and identically distributed (non-IID) data. The latter remains an unsolved challenge that can result in poorer model performance and slower training times. Despite the significance of non-IID data in FL, there is a lack of consensus among researchers about its classification and quantification. This technical survey aims to fill that gap by providing a detailed taxonomy for non-IID data, partition protocols, and metrics to quantify data heterogeneity. Additionally, we describe popular solutions to address non-IID data and standardized frameworks employed in FL with heterogeneous data. Based on our state-of-the-art survey, we present key lessons learned and suggest promising future research directions.

Paper Structure

This paper contains 53 sections, 6 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Outline of our survey
  • Figure 2: Research interest in general FL vs. non-IIDness in FL
  • Figure 3: PRISMA flow for gathering relevant references
  • Figure 4: Rankings and quartiles distribution comparison
  • Figure 5: Geographic distribution of collected papers by first author affiliation country. Map values indicate each country's relative prevalence inside the collected database.
  • ...and 7 more figures