Table of Contents
Fetching ...

Is Data Valuation Learnable and Interpretable?

Ou Wu, Weiyao Zhu, Mengyang Li

TL;DR

This paper tackles the question of whether data valuation can be learned and made interpretable. It introduces a learning framework that predicts Shapley-like data values β_i from per-sample data-perception features u_i through two base models: an MLP (MLPbV) and a Sparse Regression Tree (SRTbV). By defining 11 universal sample characteristics and training a fixed-parameter valuation function f(·; Θ), the approach enables knowledge reuse across tasks and yields interpretable valuations via the SRT. Empirical results on CIFAR-10/100, IMDB, BBC, and ImageNet-scale data demonstrate that the learned valuations closely match Shapley-based baselines and that the SRT provides extractable, human-interpretable valuation rules. The work thus offers a scalable path to learnable, interpretable data valuation with practical relevance to data curation and pricing in data-centric pipelines.

Abstract

Measuring the value of individual samples is critical for many data-driven tasks, e.g., the training of a deep learning model. Recent literature witnesses the substantial efforts in developing data valuation methods. The primary data valuation methodology is based on the Shapley value from game theory, and various methods are proposed along this path. {Even though Shapley value-based valuation has solid theoretical basis, it is entirely an experiment-based approach and no valuation model has been constructed so far.} In addition, current data valuation methods ignore the interpretability of the output values, despite an interptable data valuation method is of great helpful for applications such as data pricing. This study aims to answer an important question: is data valuation learnable and interpretable? A learned valuation model have several desirable merits such as fixed number of parameters and knowledge reusability. An intrepretable data valuation model can explain why a sample is valuable or invaluable. To this end, two new data value modeling frameworks are proposed, in which a multi-layer perception~(MLP) and a new regression tree are utilized as specific base models for model training and interpretability, respectively. Extensive experiments are conducted on benchmark datasets. {The experimental results provide a positive answer for the question.} Our study opens up a new technical path for the assessing of data values. Large data valuation models can be built across many different data-driven tasks, which can promote the widespread application of data valuation.

Is Data Valuation Learnable and Interpretable?

TL;DR

This paper tackles the question of whether data valuation can be learned and made interpretable. It introduces a learning framework that predicts Shapley-like data values β_i from per-sample data-perception features u_i through two base models: an MLP (MLPbV) and a Sparse Regression Tree (SRTbV). By defining 11 universal sample characteristics and training a fixed-parameter valuation function f(·; Θ), the approach enables knowledge reuse across tasks and yields interpretable valuations via the SRT. Empirical results on CIFAR-10/100, IMDB, BBC, and ImageNet-scale data demonstrate that the learned valuations closely match Shapley-based baselines and that the SRT provides extractable, human-interpretable valuation rules. The work thus offers a scalable path to learnable, interpretable data valuation with practical relevance to data curation and pricing in data-centric pipelines.

Abstract

Measuring the value of individual samples is critical for many data-driven tasks, e.g., the training of a deep learning model. Recent literature witnesses the substantial efforts in developing data valuation methods. The primary data valuation methodology is based on the Shapley value from game theory, and various methods are proposed along this path. {Even though Shapley value-based valuation has solid theoretical basis, it is entirely an experiment-based approach and no valuation model has been constructed so far.} In addition, current data valuation methods ignore the interpretability of the output values, despite an interptable data valuation method is of great helpful for applications such as data pricing. This study aims to answer an important question: is data valuation learnable and interpretable? A learned valuation model have several desirable merits such as fixed number of parameters and knowledge reusability. An intrepretable data valuation model can explain why a sample is valuable or invaluable. To this end, two new data value modeling frameworks are proposed, in which a multi-layer perception~(MLP) and a new regression tree are utilized as specific base models for model training and interpretability, respectively. Extensive experiments are conducted on benchmark datasets. {The experimental results provide a positive answer for the question.} Our study opens up a new technical path for the assessing of data values. Large data valuation models can be built across many different data-driven tasks, which can promote the widespread application of data valuation.
Paper Structure (18 sections, 2 theorems, 13 equations, 10 figures, 2 algorithms)

This paper contains 18 sections, 2 theorems, 13 equations, 10 figures, 2 algorithms.

Key Result

Proposition 3.1

When $M \to +\infty$, $\beta^*_i$ solved from (mse-1) is proportion to the AME value of each sample: where $AME_i$ is the average marginal effect of $x_i$.

Figures (10)

  • Figure 1: The MLP-based regression for data valuation.
  • Figure 2: The comparison results on Shapley value estimation on CIFAR10.
  • Figure 3: The comparison results on Shapley value estimation on CIFAR100.
  • Figure 4: The comparison results on Shapley value estimation on BBC.
  • Figure 5: The comparison results on Shapley value estimation on IMDB.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Proposition 3.1
  • Lemma 1