Table of Contents
Fetching ...

Data Valuation and Detections in Federated Learning

Wenqian Li, Shuran Fu, Fengrui Zhang, Yan Pang

TL;DR

This work addresses data valuation and datum detection in privacy-sensitive Federated Learning by introducing FedBary, a privacy-preserving framework based on the $p$-Wasserstein distance. FedBary casts data valuation as a federated Wasserstein barycenter problem, enabling contribution evaluation and dataset filtering without sharing raw data or fixing a learning algorithm, and supports analysis with or without a validation set via $\mathcal{W}_p$ distances and interpolating measures. Theoretical guarantees include a convergence result for the FedBary updates and a bound linking distributional distance to validation loss, while experiments on CIFAR-10 and synthetic setups demonstrate robust client ranking, effective noisy data/detection, and improved FL performance, all with data-duplication robustness. Overall, FedBary offers a scalable, privacy-conscious basis for data marketplaces and incentive mechanisms in FL, balancing transparency, efficiency, and protection of sensitive data.

Abstract

Federated Learning (FL) enables collaborative model training while preserving the privacy of raw data. A challenge in this framework is the fair and efficient valuation of data, which is crucial for incentivizing clients to contribute high-quality data in the FL task. In scenarios involving numerous data clients within FL, it is often the case that only a subset of clients and datasets are pertinent to a specific learning task, while others might have either a negative or negligible impact on the model training process. This paper introduces a novel privacy-preserving method for evaluating client contributions and selecting relevant datasets without a pre-specified training algorithm in an FL task. Our proposed approach FedBary, utilizes Wasserstein distance within the federated context, offering a new solution for data valuation in the FL framework. This method ensures transparent data valuation and efficient computation of the Wasserstein barycenter and reduces the dependence on validation datasets. Through extensive empirical experiments and theoretical analyses, we demonstrate the potential of this data valuation method as a promising avenue for FL research.

Data Valuation and Detections in Federated Learning

TL;DR

This work addresses data valuation and datum detection in privacy-sensitive Federated Learning by introducing FedBary, a privacy-preserving framework based on the -Wasserstein distance. FedBary casts data valuation as a federated Wasserstein barycenter problem, enabling contribution evaluation and dataset filtering without sharing raw data or fixing a learning algorithm, and supports analysis with or without a validation set via distances and interpolating measures. Theoretical guarantees include a convergence result for the FedBary updates and a bound linking distributional distance to validation loss, while experiments on CIFAR-10 and synthetic setups demonstrate robust client ranking, effective noisy data/detection, and improved FL performance, all with data-duplication robustness. Overall, FedBary offers a scalable, privacy-conscious basis for data marketplaces and incentive mechanisms in FL, balancing transparency, efficiency, and protection of sensitive data.

Abstract

Federated Learning (FL) enables collaborative model training while preserving the privacy of raw data. A challenge in this framework is the fair and efficient valuation of data, which is crucial for incentivizing clients to contribute high-quality data in the FL task. In scenarios involving numerous data clients within FL, it is often the case that only a subset of clients and datasets are pertinent to a specific learning task, while others might have either a negative or negligible impact on the model training process. This paper introduces a novel privacy-preserving method for evaluating client contributions and selecting relevant datasets without a pre-specified training algorithm in an FL task. Our proposed approach FedBary, utilizes Wasserstein distance within the federated context, offering a new solution for data valuation in the FL framework. This method ensures transparent data valuation and efficient computation of the Wasserstein barycenter and reduces the dependence on validation datasets. Through extensive empirical experiments and theoretical analyses, we demonstrate the potential of this data valuation method as a promising avenue for FL research.
Paper Structure (28 sections, 2 theorems, 34 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 28 sections, 2 theorems, 34 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Let $P_i$ be the distribution of $i$-th client, where $i \in [1,N]$, and $Q^{(k)}$ be the Wasserstein barycenter at iteration $k$, $\gamma_i^{(k)}, \eta_{P_i}^{(k)}, \eta_{Q_i}^{(k)}$ be the interpolating measures computed in the Algorithm algorithm_1_fedbary. Define then, the sequence $(A^{(k)})$ is non-increasing and converges to $\sum_{i=1}^N\mathcal{W}_p(P_i, Q)$.

Figures (6)

  • Figure 1: Client holds $P$ and server holds $Q$, the interpolating measure $\gamma$ aids to measure distance $\mathcal{W}_p(P,Q)$. Local interpolating measures $\eta_p$ and $\eta_Q$ are shared for calculation and detection.
  • Figure 2: We compare 5 different approaches for shuffled data detections: gamma corresponds to $\partial \mathcal{W}_p(P_i,\gamma_i)$; eta corresponds to $\partial \mathcal{W}_p(P_i,\eta_{Q_i})$; access data corresponds to $\partial \mathcal{W}_p(P_i,Q)$ (access both datasets). Noise Portion represents the actual noisy data ratio in a dataset, count (c-) indicates the number of detected negative calibrated gradient values; Detection ratio (r-) measures detection accuracy: ($\#$detected noisy data $\slash$$\#$noisy data); servergamma (gray) and servereta (purple) lines corresponds to $\partial \mathcal{W}_p(\eta_{P_i},\eta_{Q_i})$ and $\partial \mathcal{W}_p(\eta_{P_i},Q)$.
  • Figure 3: Approximated and true Wasserstein barycenter of 3 Gaussian distributions: 3-th epoch and 10-th epoch (overlapping).
  • Figure 4: Scatter plots: percentage of contribution for 5 clients under different valuation metrics (Case1$\sim$5); Histogram: distance between the local distribution and the Wasserstein barycenter when validation set is not available.
  • Figure 5: Detection results on CIFAR10: Corrupt feature samples detections and point Removal Comparison (a,b): FedBary and Lava are superior; Mislabeled samples detections and point Removal Comparison (c,d): FedBary performs similarly to Lava and conducts relatively accurate detections;
  • ...and 1 more figures

Theorems & Definitions (7)

  • Definition 1
  • Theorem 1
  • Theorem 2
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5