Table of Contents
Fetching ...

Data Overvaluation Attack and Truthful Data Valuation in Federated Learning

Shuyuan Zheng, Sudong Cai, Chuan Xiao, Yang Cao, Jianbin Qin, Masatoshi Yoshikawa, Makoto Onizuka

TL;DR

The paper identifies a vulnerability in data valuation for federated learning: attackers can manipulate linear valuation metrics to inflate their data's perceived contribution. It formalizes a data overvaluation attack and proves that many common metrics are susceptible, while introducing Truth-Shapley, a Bayesian incentive-compatible, linear valuation that preserves fairness and efficiency. The authors provide theoretical characterization showing when truthfulness can be achieved and demonstrate, through extensive experiments across multiple FL settings and tasks, that Truth-Shapley is robust to manipulation and supports effective data selection and fair reward distribution. This work lays a foundation for secure, truthful data valuation in collaborative learning and suggests directions for defenses and extensions to other FL architectures.

Abstract

In collaborative machine learning (CML), data valuation, i.e., evaluating the contribution of each client's data to the machine learning model, has become a critical task for incentivizing and selecting positive data contributions. However, existing studies often assume that clients engage in data valuation truthfully, overlooking the practical motivation for clients to exaggerate their contributions. To unlock this threat, this paper introduces the data overvaluation attack, enabling strategic clients to have their data significantly overvalued in federated learning, a widely adopted paradigm for decentralized CML. Furthermore, we propose a Bayesian truthful data valuation metric, named Truth-Shapley. Truth-Shapley is the unique metric that guarantees some promising axioms for data valuation while ensuring that clients' optimal strategy is to perform truthful data valuation under certain conditions. Our experiments demonstrate the vulnerability of existing data valuation metrics to the proposed attack and validate the robustness and effectiveness of Truth-Shapley.

Data Overvaluation Attack and Truthful Data Valuation in Federated Learning

TL;DR

The paper identifies a vulnerability in data valuation for federated learning: attackers can manipulate linear valuation metrics to inflate their data's perceived contribution. It formalizes a data overvaluation attack and proves that many common metrics are susceptible, while introducing Truth-Shapley, a Bayesian incentive-compatible, linear valuation that preserves fairness and efficiency. The authors provide theoretical characterization showing when truthfulness can be achieved and demonstrate, through extensive experiments across multiple FL settings and tasks, that Truth-Shapley is robust to manipulation and supports effective data selection and fair reward distribution. This work lays a foundation for secure, truthful data valuation in collaborative learning and suggests directions for defenses and extensions to other FL architectures.

Abstract

In collaborative machine learning (CML), data valuation, i.e., evaluating the contribution of each client's data to the machine learning model, has become a critical task for incentivizing and selecting positive data contributions. However, existing studies often assume that clients engage in data valuation truthfully, overlooking the practical motivation for clients to exaggerate their contributions. To unlock this threat, this paper introduces the data overvaluation attack, enabling strategic clients to have their data significantly overvalued in federated learning, a widely adopted paradigm for decentralized CML. Furthermore, we propose a Bayesian truthful data valuation metric, named Truth-Shapley. Truth-Shapley is the unique metric that guarantees some promising axioms for data valuation while ensuring that clients' optimal strategy is to perform truthful data valuation under certain conditions. Our experiments demonstrate the vulnerability of existing data valuation metrics to the proposed attack and validate the robustness and effectiveness of Truth-Shapley.

Paper Structure

This paper contains 26 sections, 7 theorems, 19 equations, 18 tables, 3 algorithms.

Key Result

Lemma 3.1

If a data valuation metric $\phi$ satisfies LIN, then there exist functions $\beta_i: 2^{D_{\mathbb{N}}} \rightarrow \mathbb{R}$ and $\beta_{i,j}: 2^{D_{\mathbb{N}}} \rightarrow \mathbb{R}$ such that $\phi_{i,j}(D_{\mathbb{N}}, v) \equiv \sum_{\mathcal{S} \subseteq D_{\mathbb{N}}} \beta_{i,j}(\mathc

Theorems & Definitions (16)

  • Definition 2.1: Data Valuation
  • Lemma 3.1
  • Definition 3.2: Data Overvaluation Attack
  • Lemma 3.3
  • Definition 4.1: Bayesian Incentive Compatibility for Truthful Data Valuation
  • Theorem 4.3: Characterization 1
  • Theorem 4.4: Characterization 2
  • Theorem 4.5
  • Theorem 4.6
  • Theorem A.1: Uniqueness of SV shapley1953value
  • ...and 6 more