Table of Contents
Fetching ...

Efficient Data Valuation Approximation in Federated Learning: A Sampling-based Approach

Shuyue Wei, Yongxin Tong, Zimu Zhou, Tianran He, Yi Xu

TL;DR

This work tackles the prohibitive computational cost of Shapley-value-based data valuation in federated learning by introducing a unified stratified sampling framework and a novel algorithm called IPSS. The authors show that Monte Carlo SV within this stratified framework yields lower variance than CC-SV and exploit a key combinations phenomenon to prune low-impact dataset mixes, achieving substantial speedups with controlled accuracy. Theoretical analysis provides an error bound and complexity, while extensive experiments on MNIST-derived synthetic data and real-world FL benchmarks (FEMNIST, Adult, Sent-140) demonstrate IPSS’s dominance in efficiency and accuracy, including scalability to 100 clients. The method offers a practical pathway to fairly value data contributions across providers, enabling fair compensation and broader participation in cross-silo FL deployments.

Abstract

Federated learning paradigm to utilize datasets across multiple data providers. In FL, cross-silo data providers often hesitate to share their high-quality dataset unless their data value can be fairly assessed. Shapley value (SV) has been advocated as the standard metric for data valuation in FL due to its desirable properties. However, the computational overhead of SV is prohibitive in practice, as it inherently requires training and evaluating an FL model across an exponential number of dataset combinations. Furthermore, existing solutions fail to achieve high accuracy and efficiency, making practical use of SV still out of reach, because they ignore choosing suitable computation scheme for approximation framework and overlook the property of utility function in FL. We first propose a unified stratified-sampling framework for two widely-used schemes. Then, we analyze and choose the more promising scheme under the FL linear regression assumption. After that, we identify a phenomenon termed key combinations, where only limited dataset combinations have a high-impact on final data value. Building on these insights, we propose a practical approximation algorithm, IPSS, which strategically selects high-impact dataset combinations rather than evaluating all possible combinations, thus substantially reducing time cost with minor approximation error. Furthermore, we conduct extensive evaluations on the FL benchmark datasets to demonstrate that our proposed algorithm outperforms a series of representative baselines in terms of efficiency and effectiveness.

Efficient Data Valuation Approximation in Federated Learning: A Sampling-based Approach

TL;DR

This work tackles the prohibitive computational cost of Shapley-value-based data valuation in federated learning by introducing a unified stratified sampling framework and a novel algorithm called IPSS. The authors show that Monte Carlo SV within this stratified framework yields lower variance than CC-SV and exploit a key combinations phenomenon to prune low-impact dataset mixes, achieving substantial speedups with controlled accuracy. Theoretical analysis provides an error bound and complexity, while extensive experiments on MNIST-derived synthetic data and real-world FL benchmarks (FEMNIST, Adult, Sent-140) demonstrate IPSS’s dominance in efficiency and accuracy, including scalability to 100 clients. The method offers a practical pathway to fairly value data contributions across providers, enabling fair compensation and broader participation in cross-silo FL deployments.

Abstract

Federated learning paradigm to utilize datasets across multiple data providers. In FL, cross-silo data providers often hesitate to share their high-quality dataset unless their data value can be fairly assessed. Shapley value (SV) has been advocated as the standard metric for data valuation in FL due to its desirable properties. However, the computational overhead of SV is prohibitive in practice, as it inherently requires training and evaluating an FL model across an exponential number of dataset combinations. Furthermore, existing solutions fail to achieve high accuracy and efficiency, making practical use of SV still out of reach, because they ignore choosing suitable computation scheme for approximation framework and overlook the property of utility function in FL. We first propose a unified stratified-sampling framework for two widely-used schemes. Then, we analyze and choose the more promising scheme under the FL linear regression assumption. After that, we identify a phenomenon termed key combinations, where only limited dataset combinations have a high-impact on final data value. Building on these insights, we propose a practical approximation algorithm, IPSS, which strategically selects high-impact dataset combinations rather than evaluating all possible combinations, thus substantially reducing time cost with minor approximation error. Furthermore, we conduct extensive evaluations on the FL benchmark datasets to demonstrate that our proposed algorithm outperforms a series of representative baselines in terms of efficiency and effectiveness.

Paper Structure

This paper contains 30 sections, 4 theorems, 17 equations, 10 figures, 5 tables, 3 algorithms.

Key Result

Theorem 1

The Alg. alg:mc4sv provides an unbiased estimation of SV in expectation when using both the MC-SV or the CC-SV.

Figures (10)

  • Figure 1: (a):Three hospitals collaborate to train the FL model and aim to identify each hospital's data value. The SV-based data valuation requires training and evaluating FL models across all possible hospital combinations (①$\sim$⑦), i.e., it needs to tackle seven FL processes. As client number increases, the number of required combinations grows exponentially. (b):Evaluations on the FL benchmark dataset FEMNIST with ten FL clients indicate that the existing solutions fail to achieve both high effectiveness and efficiency simultaneously.
  • Figure 2: Example for the unified stratified sampling framework: Both MC-SV and CC-SV rely on this hierarchical structure, which is naturally suitable for stratified sampling. There are four FL clients and model utility is below each dataset combination. For instance, the utility of FL model under dataset combination $\medop{\{\mathcal{D}_1, \mathcal{D}_3\}}$ is $\medop{0.92}$.
  • Figure 3: Observations when using the MC-SV-based scheme.
  • Figure 4: Results under combinations with size no more than $K$.
  • Figure 5: Example of Alg. \ref{['alg:lightSampling']} with the same setup as \ref{['fig:example_comb_shapley']}
  • ...and 5 more figures

Theorems & Definitions (15)

  • Definition 1: Federated learning, FL
  • Definition 2: Data valuation for FL
  • Definition 3: MC-SV based computation scheme
  • Definition 4: CC-SV based computation scheme
  • Example 1
  • Example 2
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • ...and 5 more