Table of Contents
Fetching ...

Distributionally Robust Clustered Federated Learning: A Case Study in Healthcare

Xenia Konti, Hans Riess, Manos Giannopoulos, Yi Shen, Michael J. Pencina, Nicoleta J. Economou-Zavlanos, Michael M. Zavlanos

TL;DR

This paper introduces a novel algorithm, which is described as Cross-silo Robust Clustered Federated Learning (CS-RCFL), that leverages the Wasserstein distance to construct ambiguity sets around each client’s empirical distribution to determine the optimal distributionally robust clustering of clients into coalitions.

Abstract

In this paper, we address the challenge of heterogeneous data distributions in cross-silo federated learning by introducing a novel algorithm, which we term Cross-silo Robust Clustered Federated Learning (CS-RCFL). Our approach leverages the Wasserstein distance to construct ambiguity sets around each client's empirical distribution that capture possible distribution shifts in the local data, enabling evaluation of worst-case model performance. We then propose a model-agnostic integer fractional program to determine the optimal distributionally robust clustering of clients into coalitions so that possible biases in the local models caused by statistically heterogeneous client datasets are avoided, and analyze our method for linear and logistic regression models. Finally, we discuss a federated learning protocol that ensures the privacy of client distributions, a critical consideration, for instance, when clients are healthcare institutions. We evaluate our algorithm on synthetic and real-world healthcare data.

Distributionally Robust Clustered Federated Learning: A Case Study in Healthcare

TL;DR

This paper introduces a novel algorithm, which is described as Cross-silo Robust Clustered Federated Learning (CS-RCFL), that leverages the Wasserstein distance to construct ambiguity sets around each client’s empirical distribution to determine the optimal distributionally robust clustering of clients into coalitions.

Abstract

In this paper, we address the challenge of heterogeneous data distributions in cross-silo federated learning by introducing a novel algorithm, which we term Cross-silo Robust Clustered Federated Learning (CS-RCFL). Our approach leverages the Wasserstein distance to construct ambiguity sets around each client's empirical distribution that capture possible distribution shifts in the local data, enabling evaluation of worst-case model performance. We then propose a model-agnostic integer fractional program to determine the optimal distributionally robust clustering of clients into coalitions so that possible biases in the local models caused by statistically heterogeneous client datasets are avoided, and analyze our method for linear and logistic regression models. Finally, we discuss a federated learning protocol that ensures the privacy of client distributions, a critical consideration, for instance, when clients are healthcare institutions. We evaluate our algorithm on synthetic and real-world healthcare data.

Paper Structure

This paper contains 17 sections, 6 theorems, 22 equations, 3 figures.

Key Result

Lemma 1

Suppose $\Xi \subseteq \mathbb{R}^d$ and $f: \mathbb{R}^m \times \Xi \to \mathbb{R}$ is proper, convex, and lower semi-continuous. Suppose $\hat{\mathbb{Q}} \in \mathcal{M}(\Xi)$ is an estimated distribution on $\Xi$. Consider the worst-case expectation problem Then, the dual formulation has a zero duality-gap.

Figures (3)

  • Figure 1: Loss of the CS-RCFL method and the benchmarks for (a) linear regression models with absolute ($\ell_1$-) loss and (b) logistic regression models.
  • Figure 2: Distribution of patients' ethnicity at two different hospitals, an example that shows heterogeneity of patient populations across hospitals.
  • Figure 3: Loss of the CS-RCFL method and the benchmarks for the logistic regression model, evaluated on the eICU Collaborative Research Dataset.

Theorems & Definitions (11)

  • Lemma 1: Strong Duality esfahani2017
  • Lemma 2: Linear regression chen2018
  • Lemma 3: Logistic regression shafieezadeh2015distributionally
  • Lemma 4
  • proof
  • Theorem 1
  • proof
  • Proposition 1
  • proof
  • proof : Proof of Lemma \ref{['lemma:upper-bounds']}
  • ...and 1 more