Table of Contents
Fetching ...

Measuring the Hidden Cost of Data Valuation through Collective Disclosure

Patrick Mesana, Gilles Caporossi, Sebastien Gambs

TL;DR

The paper addresses the hidden cost of data valuation by modeling a Data Union (DU) that coordinates collective disclosure under differential privacy to regulate value distribution. It introduces the Information Disclosure Game (IDG), a Stackelberg framework where the DU sets iterative, DP-enabled disclosure policies and the Data Consumer (DC) acquires data to meet a utility target, revealing an explicit acquisition cost. Through Yelp-based experiments using $k$-NN and SBERT embeddings, the authors show that valuation inherently entails exploration costs, with Shapley-based and bandit strategies each capable of achieving target utility under budget constraints. The findings highlight the need for minimum dividend guarantees to ensure inclusivity and suggest future work on extending valuation to differentiable models and gradient-based Shapley approximations to enhance scalability and privacy–utility trade-offs.

Abstract

Data valuation methods assign marginal utility to each data point that has contributed to the training of a machine learning model. If used directly as a payout mechanism, this creates a hidden cost of valuation, in which contributors with near-zero marginal value would receive nothing, even though their data had to be collected and assessed. To better formalize this cost, we introduce a conceptual and game-theoretic model, the Information Disclosure Game, between a Data Union (sometimes also called a data trust), a member-run agent representing contributors, and a Data Consumer (e.g., a platform). After first aggregating members' data, the DU releases information progressively by adding Laplacian noise under a differentially-private mechanism. Through simulations with strategies guided by data Shapley values and multi-armed bandit exploration, we demonstrate on a Yelp review helpfulness prediction task that data valuation inherently incurs an explicit acquisition cost and that the DU's collective disclosure policy changes how this cost is distributed across members.

Measuring the Hidden Cost of Data Valuation through Collective Disclosure

TL;DR

The paper addresses the hidden cost of data valuation by modeling a Data Union (DU) that coordinates collective disclosure under differential privacy to regulate value distribution. It introduces the Information Disclosure Game (IDG), a Stackelberg framework where the DU sets iterative, DP-enabled disclosure policies and the Data Consumer (DC) acquires data to meet a utility target, revealing an explicit acquisition cost. Through Yelp-based experiments using -NN and SBERT embeddings, the authors show that valuation inherently entails exploration costs, with Shapley-based and bandit strategies each capable of achieving target utility under budget constraints. The findings highlight the need for minimum dividend guarantees to ensure inclusivity and suggest future work on extending valuation to differentiable models and gradient-based Shapley approximations to enhance scalability and privacy–utility trade-offs.

Abstract

Data valuation methods assign marginal utility to each data point that has contributed to the training of a machine learning model. If used directly as a payout mechanism, this creates a hidden cost of valuation, in which contributors with near-zero marginal value would receive nothing, even though their data had to be collected and assessed. To better formalize this cost, we introduce a conceptual and game-theoretic model, the Information Disclosure Game, between a Data Union (sometimes also called a data trust), a member-run agent representing contributors, and a Data Consumer (e.g., a platform). After first aggregating members' data, the DU releases information progressively by adding Laplacian noise under a differentially-private mechanism. Through simulations with strategies guided by data Shapley values and multi-armed bandit exploration, we demonstrate on a Yelp review helpfulness prediction task that data valuation inherently incurs an explicit acquisition cost and that the DU's collective disclosure policy changes how this cost is distributed across members.

Paper Structure

This paper contains 22 sections, 7 equations, 12 figures, 1 algorithm.

Figures (12)

  • Figure 1: Illustration of the Information Disclosure Game. A Data Union (DU) holds a private dataset and releases information to a Data Consumer (DC) by adding Laplacian noise to data points under $\epsilon$-differential privacy. The DC incrementally acquires these noisy version of the data points, denoises them using an average and train a model to reach a utility target. In this work, we focus on non-parametric $k$-Nearest Neighbors ($k$-NN) and SBERT embeddings reimers-2019-sentence-bert.
  • Figure 2: Validation accuracy for random data selection across varying parameters. Unlike Shapley-based methods, random selection fails to consistently reach the utility target (69.6%) within the budgeted iteration range.
  • Figure 3: Validation accuracy using estimated data Shapley selection across dataset percentages and iterations. The target accuracy is achieved with as little as 10% of data.
  • Figure 4: 3D visualization of hyperparameter combinations. Green dots represent successful runs in which the DC reached the utility threshold while red crosses represent failures. Success becomes unlikely with budget-per-arm below 20 or when exploration is zero.
  • Figure 5: Gini coefficient of budget usage by hyperparameter setting. High values indicate budget concentrated on few data points, typical in "lucky" early selections.
  • ...and 7 more figures