Table of Contents
Fetching ...

Rethinking Data Value: Asymmetric Data Shapley for Structure-Aware Valuation in Data Markets and Machine Learning Pipelines

Xi Zheng, Yinghui Huang, Xiangyu Chang, Ruoxi Jia, Yong Tan

TL;DR

ADS extends Data Shapley by relaxing symmetry through application-specific ordered data groups $\sigma$, defining value as the average one-step marginal contribution over permutations that respect $\sigma$. It remains replication-aware and structure-aware, with a group efficiency property that allocates each group’s value to its incremental utility over preceding groups. The framework provides two scalable estimators: a Monte Carlo ADS with an $O\bigl(n\epsilon^{-2}\log(n/\delta)\bigr)$ time bound and a KNN-ADS surrogate that yields exact valuations for KNN predictors with $O(n\log n)$ per test point. Empirical results across synthetic augmentation, federated learning, and multi-stage LLM fine-tuning show ADS better distinguishes informative from redundant data and enables fairer, more robust data-market incentives for contributors and brokers alike.

Abstract

Rigorous valuation of individual data sources is critical for fair compensation in data markets, informed data acquisition, and transparent development of ML/AI models. Classical Data Shapley (DS) provides a essential axiomatic framework for data valuation but is constrained by its symmetry axiom that assumes interchangeability of data sources. This assumption fails to capture the directional and temporal dependencies prevalent in modern ML/AI workflows, including the reliance of duplicated or augmented data on original sources and the order-specific contributions in sequential pipelines such as federated learning and multi-stage LLM fine tuning. To address these limitations, we introduce Asymmetric Data Shapley (ADS), a structure-aware data valuation framework for modern ML/AI pipelines. ADS relaxes symmetry by averaging marginal contributions only over permutations consistent with an application-specific ordering of data groups. It preserves efficiency and linearity, maintains within group symmetry and directional precedence across groups, and reduces to DS when the ordering collapses to a single group. We develop two complementary computational procedures for ADS: (i) a Monte Carlo estimator (MC-ADS) with finite-sample accuracy guarantees, and (ii) a k-nearest neighbor surrogate (KNN-ADS) that is exact and efficient for KNN predictors. Across representative settings with directional and temporal dependence, ADS consistently outperforms benchmark methods by distinguishing novel from redundant contributions and respecting the sequential nature of training. These results establish ADS as a principled and practical approach to equitable data valuation in data markets and complex ML/AI pipelines.

Rethinking Data Value: Asymmetric Data Shapley for Structure-Aware Valuation in Data Markets and Machine Learning Pipelines

TL;DR

ADS extends Data Shapley by relaxing symmetry through application-specific ordered data groups , defining value as the average one-step marginal contribution over permutations that respect . It remains replication-aware and structure-aware, with a group efficiency property that allocates each group’s value to its incremental utility over preceding groups. The framework provides two scalable estimators: a Monte Carlo ADS with an time bound and a KNN-ADS surrogate that yields exact valuations for KNN predictors with per test point. Empirical results across synthetic augmentation, federated learning, and multi-stage LLM fine-tuning show ADS better distinguishes informative from redundant data and enables fairer, more robust data-market incentives for contributors and brokers alike.

Abstract

Rigorous valuation of individual data sources is critical for fair compensation in data markets, informed data acquisition, and transparent development of ML/AI models. Classical Data Shapley (DS) provides a essential axiomatic framework for data valuation but is constrained by its symmetry axiom that assumes interchangeability of data sources. This assumption fails to capture the directional and temporal dependencies prevalent in modern ML/AI workflows, including the reliance of duplicated or augmented data on original sources and the order-specific contributions in sequential pipelines such as federated learning and multi-stage LLM fine tuning. To address these limitations, we introduce Asymmetric Data Shapley (ADS), a structure-aware data valuation framework for modern ML/AI pipelines. ADS relaxes symmetry by averaging marginal contributions only over permutations consistent with an application-specific ordering of data groups. It preserves efficiency and linearity, maintains within group symmetry and directional precedence across groups, and reduces to DS when the ordering collapses to a single group. We develop two complementary computational procedures for ADS: (i) a Monte Carlo estimator (MC-ADS) with finite-sample accuracy guarantees, and (ii) a k-nearest neighbor surrogate (KNN-ADS) that is exact and efficient for KNN predictors. Across representative settings with directional and temporal dependence, ADS consistently outperforms benchmark methods by distinguishing novel from redundant contributions and respecting the sequential nature of training. These results establish ADS as a principled and practical approach to equitable data valuation in data markets and complex ML/AI pipelines.

Paper Structure

This paper contains 20 sections, 9 theorems, 24 equations, 6 figures, 1 table, 2 algorithms.

Key Result

Lemma 3.1

Fix a loss function $L(\cdot)$ and a hypothesis class $\mathcal{H}$. For any finite collection of data sources $S$, define empirical risk minimization over data instances by Then, for every $i\in[n]$, $\phi(z_{1,i};\,v,D^{\mathrm{dup}})=\;\phi(z_{2,i};\,v,D^{\mathrm{dup}}),$ and

Figures (6)

  • Figure 1: Overview of data market involving multiple data contributors, a data broker, and model buyers.
  • Figure 2: Relative accuracy is test accuracy normalized by the baseline model before any intervention. (a) and (b) remove low-value and high-value augmented points, respectively. (c) and (d) add low-value and high-value augmented points to the original set, respectively. We compare ADS (MC-ADS and KNN-ADS where applicable) with symmetric baselines (MC-DS and KNN-ADS), LOO, and random selection. Results are averaged over 10 seeds with 95% confidence intervals. ADS produces the strongest positive and negative shifts in the expected directions, indicating better discrimination between informative and redundant augmentations.
  • Figure 3: Fair allocation on MNIST under two broker strategies, shown as side by side (DS vs. ADS) stacked totals within each configuration. (a) Replication: As identical copies are added (Original$\rightarrow$Copied once$\rightarrow$Copied twice), DS progressively shifts value from contributors to the broker, while ADS keeps the contributors’ total essentially unchanged and assigns only small incremental gains to each copy. (b) Augmentation: After retaining only positively valued augmentations, DS still reallocates value away from contributors, while ADS preserves the contributors’ total and credits the broker exactly for the incremental gains contributed by informative synthetic data.
  • Figure 4: Federated learning with noisy contributors. (a) and (b): test accuracy when the top $3$ or top $4$ contributors per round are selected using different valuation methods. (c): cumulative detection rate of noisy contributors as we sweep through the worst ranked contributors. Results are averaged over $100$ runs with $95\%$ confidence intervals. MC-ADS consistently yields faster accuracy gains and superior noise detection relative to LOO and random selection.
  • Figure 5: Multi-stage fine-tuning of four LLMs: average estimation error under different data valuation strategies per iteration. Results are averaged over 15 runs with 90% confidence intervals. ADS consistently outperforms LOO and random selection in estimation error reduction.
  • ...and 1 more figures

Theorems & Definitions (22)

  • Example 1.1: Synthetic Data Valuation
  • Example 1.2: Participant Valuation in Federated Learning
  • Example 1.3: Dataset Procurement in Multi-Stage LLM Fine-Tuning
  • Definition 3.1: Single-step marginal contribution
  • Example 1 (continued)
  • Lemma 3.1: Symmetric valuation under redundant duplication
  • Example 2 (continued)
  • Definition 3.2: One-step state-conditioned marginal contribution
  • Remark 3.1: Aggregating value for contributors active across multiple rounds
  • Lemma 3.2: Violation of symmetry along the realized sequential trajectory
  • ...and 12 more