Table of Contents
Fetching ...

A Universal Metric of Dataset Similarity for Cross-silo Federated Learning

Ahmed Elhussein, Gamze Gursoy

TL;DR

This work introduces a universal, privacy-preserving dataset similarity metric for cross-silo federated learning that remains task-agnostic and data-exchange-free. By probing a global model after one FL round, it constructs per-class transport costs from final-layer activations using a hybrid feature- and label-cost OT, then aggregates via Sinkhorn optimization to yield a bounded cost in [0,1]. The method combines Secure Multiparty Computation for feature-level similarity and differential privacy for class-distribution differences, with theoretical links to weight divergence in FL and strong empirical validation across synthetic, benchmark, and medical imaging datasets. Practically, the metric guides algorithmic choices for personalization and collaboration, reducing the need for extensive multi-round experimentation and enabling principled site-selection for federated studies.

Abstract

Federated Learning is increasingly used in domains such as healthcare to facilitate collaborative model training without data-sharing. However, datasets located in different sites are often non-identically distributed, leading to degradation of model performance in FL. Most existing methods for assessing these distribution shifts are limited by being dataset or task-specific. Moreover, these metrics can only be calculated by exchanging data, a practice restricted in many FL scenarios. To address these challenges, we propose a novel metric for assessing dataset similarity. Our metric exhibits several desirable properties for FL: it is dataset-agnostic, is calculated in a privacy-preserving manner, and is computationally efficient, requiring no model training. In this paper, we first establish a theoretical connection between our metric and training dynamics in FL. Next, we extensively evaluate our metric on a range of datasets including synthetic, benchmark, and medical imaging datasets. We demonstrate that our metric shows a robust and interpretable relationship with model performance and can be calculated in privacy-preserving manner. As the first federated dataset similarity metric, we believe this metric can better facilitate successful collaborations between sites.

A Universal Metric of Dataset Similarity for Cross-silo Federated Learning

TL;DR

This work introduces a universal, privacy-preserving dataset similarity metric for cross-silo federated learning that remains task-agnostic and data-exchange-free. By probing a global model after one FL round, it constructs per-class transport costs from final-layer activations using a hybrid feature- and label-cost OT, then aggregates via Sinkhorn optimization to yield a bounded cost in [0,1]. The method combines Secure Multiparty Computation for feature-level similarity and differential privacy for class-distribution differences, with theoretical links to weight divergence in FL and strong empirical validation across synthetic, benchmark, and medical imaging datasets. Practically, the metric guides algorithmic choices for personalization and collaboration, reducing the need for extensive multi-round experimentation and enabling principled site-selection for federated studies.

Abstract

Federated Learning is increasingly used in domains such as healthcare to facilitate collaborative model training without data-sharing. However, datasets located in different sites are often non-identically distributed, leading to degradation of model performance in FL. Most existing methods for assessing these distribution shifts are limited by being dataset or task-specific. Moreover, these metrics can only be calculated by exchanging data, a practice restricted in many FL scenarios. To address these challenges, we propose a novel metric for assessing dataset similarity. Our metric exhibits several desirable properties for FL: it is dataset-agnostic, is calculated in a privacy-preserving manner, and is computationally efficient, requiring no model training. In this paper, we first establish a theoretical connection between our metric and training dynamics in FL. Next, we extensively evaluate our metric on a range of datasets including synthetic, benchmark, and medical imaging datasets. We demonstrate that our metric shows a robust and interpretable relationship with model performance and can be calculated in privacy-preserving manner. As the first federated dataset similarity metric, we believe this metric can better facilitate successful collaborations between sites.
Paper Structure (29 sections, 2 theorems, 6 equations, 5 figures, 1 table, 3 algorithms)

This paper contains 29 sections, 2 theorems, 6 equations, 5 figures, 1 table, 3 algorithms.

Key Result

Proposition 1

Let $\mathbf{z}_i, \mathbf{z}_j \in \mathbb{R}^d$ be $\ell_2$-normalized final-layer activation vectors. If $\mathbf{z}_i, \mathbf{z}_j$ are modeled as independent random vectors whose directions are isotropically distributed on $S^{d-1}$, then for $t \in [0,1]$:

Figures (5)

  • Figure 1: Performance across varying costs. Results show percentage improvement over local training baseline, with FedAvg (black) as the primary comparison. FedProx (red), pFedME (green), and Ditto (purple) are shown with reduced opacity for reference.
  • Figure 2: Relationship between optimal regularization parameters and metric for personalized FL algorithms. Higher costs correlate with lower parameters, indicating stronger personalization is beneficial when clients have heterogeneous data.
  • Figure 3: Weight divergence during training for FedAvg models
  • Figure 4: Scores calculated from full dataset and subsampled dataset. Dashed Y=X represents perfect agreement.
  • Figure 5: Performance across varying Wasserstein distances. Percentage improvement over local training baseline, with FedAvg (black) as the primary comparison. FedProx (red), pFedME (green), and Ditto (purple) are shown with reduced opacity.

Theorems & Definitions (2)

  • Proposition 1: Concentration of inner products for independent activations
  • Proposition 2: Gradient Dissimilarity for Same-Class Samples