Non-IID data in Federated Learning: A Survey with Taxonomy, Metrics, Methods, Frameworks and Future Directions
Daniel M. Jimenez G., David Solans, Mikko Heikkila, Andrea Vitaletti, Nicolas Kourtellis, Aris Anagnostopoulos, Ioannis Chatzigiannakis
TL;DR
This survey tackles non-IID data in Federated Learning by delivering a formal taxonomy of data heterogeneity, detailing partition protocols and metrics to quantify skew, and surveying practical non-IID solutions and standardized frameworks. It integrates a broad meta-analysis of literature quality, geography, and collaboration patterns, and introduces modality skew as a novel category. The work highlights the lack of consensus on non-IID definitions, the need for joint multi-skew evaluation, and the limited adoption of standardized FL frameworks, motivating future benchmarks and theoretical progress. Overall, the paper provides a comprehensive foundation for robust, reproducible evaluation and development of FL methods under diverse data heterogeneity, with clear directions toward multimodal data, improved metrics, and more efficient, adaptive algorithms.
Abstract
Recent advances in machine learning have highlighted Federated Learning (FL) as a promising approach that enables multiple distributed users (so-called clients) to collectively train ML models without sharing their private data. While this privacy-preserving method shows potential, it struggles when data across clients is not independent and identically distributed (non-IID) data. The latter remains an unsolved challenge that can result in poorer model performance and slower training times. Despite the significance of non-IID data in FL, there is a lack of consensus among researchers about its classification and quantification. This technical survey aims to fill that gap by providing a detailed taxonomy for non-IID data, partition protocols, and metrics to quantify data heterogeneity. Additionally, we describe popular solutions to address non-IID data and standardized frameworks employed in FL with heterogeneous data. Based on our state-of-the-art survey, we present key lessons learned and suggest promising future research directions.
