Benchmarking Algorithms for Federated Domain Generalization

Ruqi Bai; Saurabh Bagchi; David I. Inouye

Benchmarking Algorithms for Federated Domain Generalization

Ruqi Bai, Saurabh Bagchi, David I. Inouye

TL;DR

This work defines Federated DG as a domain generalization problem under FL with domain-based client heterogeneity and introduces a scalable benchmark methodology and open-source code to evaluate 14 methods across 7 diverse datasets. It proposes a novel Heterogeneous Partitioning framework that interpolates between homogeneous and complete heterogeneous client data via a parameter\ $\lambda$, ensuring complete heterogeneity and true partitioning while balancing client data. The study systematically evaluates centralized DG adaptations, FL-based heterogeneity methods, and Federated DG-specific algorithms, revealing that FedAvg-ERM remains a strong baseline and that current Federated DG methods struggle with large client counts and realistic domain shifts, leaving significant gaps to close. The benchmark and findings provide a foundation for standardized evaluation and highlight practical directions for developing robust Federated DG methods with privacy-preserving data partitioning and scalable evaluation infrastructure.

Abstract

While prior domain generalization (DG) benchmarks consider train-test dataset heterogeneity, we evaluate Federated DG which introduces federated learning (FL) specific challenges. Additionally, we explore domain-based heterogeneity in clients' local datasets - a realistic Federated DG scenario. Prior Federated DG evaluations are limited in terms of the number or heterogeneity of clients and dataset diversity. To address this gap, we propose an Federated DG benchmark methodology that enables control of the number and heterogeneity of clients and provides metrics for dataset difficulty. We then apply our methodology to evaluate 14 Federated DG methods, which include centralized DG methods adapted to the FL context, FL methods that handle client heterogeneity, and methods designed specifically for Federated DG. Our results suggest that despite some progress, there remain significant performance gaps in Federated DG particularly when evaluating with a large number of clients, high client heterogeneity, or more realistic datasets. Please check our extendable benchmark code here: https://github.com/inouye-lab/FedDG_Benchmark.

Benchmarking Algorithms for Federated Domain Generalization

TL;DR

, ensuring complete heterogeneity and true partitioning while balancing client data. The study systematically evaluates centralized DG adaptations, FL-based heterogeneity methods, and Federated DG-specific algorithms, revealing that FedAvg-ERM remains a strong baseline and that current Federated DG methods struggle with large client counts and realistic domain shifts, leaving significant gaps to close. The benchmark and findings provide a foundation for standardized evaluation and highlight practical directions for developing robust Federated DG methods with privacy-preserving data partitioning and scalable evaluation infrastructure.

Abstract

Paper Structure (40 sections, 1 theorem, 16 equations, 10 figures, 13 tables, 2 algorithms)

This paper contains 40 sections, 1 theorem, 16 equations, 10 figures, 13 tables, 2 algorithms.

Introduction
Background: Federated Domain Generalization Methods
Problem Background and Setup
Current Data partition methods
Heterogeneous Partitioning Method
Benchmark Methodology and Evaluations
Dataset Type and Dataset Difficulty Metrics
Benchmark Methods
Main Results
Conclusion and Discussion
Appendix
Data Partition in Federated DG
Heterogeneous Partitioning Algorithm and its guarantee
Other Partition Methods
Current Methods in solving DG
...and 25 more sections

Key Result

Proposition A.1

When $C\geq D,$$\mathcal{P}_0$ is optimal for eqn:opt_part. When $C<D,$eqn:opt_part is NP hard, and $\mathcal{P}_0$ is a fast greed approximation.

Figures (10)

Figure 1: Distinct color refers to distinct domain data, and $\lambda$ is the domain balancing parameter. (a): train-test domain heterogeneity. (b): domain partitioning when $C\leq D$ and domain partitioning when $C>D.$ (c): domain partitioning illustration when $C\leq D;$ homogeneous ($\lambda=1$), heterogeneous ($\lambda=0.1$), and extreme heterogeneous ($\lambda=0$).
Figure 2: Performance based on accuracy versus the number of clients across the PACS, CelebA, and Camelyon17 datasets.
Figure 3: PACS: Held-out DG test accuracy vs. varying communications (resp. varying echoes ).
Figure 4: Convergence curve on PACS; total clients and training domains $(C,D)=(100, 2);$ increasing domain heterogeneity from left to right: $\lambda=(1,0.1,0).$
Figure 5: Accuracy versus communication rounds for IWildCam; Total clients number $C=243$; increasing heterogeneity from left to right panel: $\lambda=(1,0.1,0).$
...and 5 more figures

Theorems & Definitions (11)

Remark 4.1
Remark 4.2
Remark 4.3
Remark 4.4
Remark 4.5
Remark 4.6
Remark 4.7
Remark 4.8
Proposition A.1
proof
...and 1 more

Benchmarking Algorithms for Federated Domain Generalization

TL;DR

Abstract

Benchmarking Algorithms for Federated Domain Generalization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (11)