Benchmarking Algorithms for Federated Domain Generalization
Ruqi Bai, Saurabh Bagchi, David I. Inouye
TL;DR
This work defines Federated DG as a domain generalization problem under FL with domain-based client heterogeneity and introduces a scalable benchmark methodology and open-source code to evaluate 14 methods across 7 diverse datasets. It proposes a novel Heterogeneous Partitioning framework that interpolates between homogeneous and complete heterogeneous client data via a parameter\ $\lambda$, ensuring complete heterogeneity and true partitioning while balancing client data. The study systematically evaluates centralized DG adaptations, FL-based heterogeneity methods, and Federated DG-specific algorithms, revealing that FedAvg-ERM remains a strong baseline and that current Federated DG methods struggle with large client counts and realistic domain shifts, leaving significant gaps to close. The benchmark and findings provide a foundation for standardized evaluation and highlight practical directions for developing robust Federated DG methods with privacy-preserving data partitioning and scalable evaluation infrastructure.
Abstract
While prior domain generalization (DG) benchmarks consider train-test dataset heterogeneity, we evaluate Federated DG which introduces federated learning (FL) specific challenges. Additionally, we explore domain-based heterogeneity in clients' local datasets - a realistic Federated DG scenario. Prior Federated DG evaluations are limited in terms of the number or heterogeneity of clients and dataset diversity. To address this gap, we propose an Federated DG benchmark methodology that enables control of the number and heterogeneity of clients and provides metrics for dataset difficulty. We then apply our methodology to evaluate 14 Federated DG methods, which include centralized DG methods adapted to the FL context, FL methods that handle client heterogeneity, and methods designed specifically for Federated DG. Our results suggest that despite some progress, there remain significant performance gaps in Federated DG particularly when evaluating with a large number of clients, high client heterogeneity, or more realistic datasets. Please check our extendable benchmark code here: https://github.com/inouye-lab/FedDG_Benchmark.
