Table of Contents
Fetching ...

Proper Dataset Valuation by Pointwise Mutual Information

Shuran Zheng, Xuan Qi, Rui Ray Chen, Yongchan Kwon, James Zou

TL;DR

This work reframes dataset valuation through an information-theoretic lens, arguing that dataset quality should be measured by informativeness about the true model parameters under Blackwell ordering rather than solely by test-set performance. It introduces the PMI Dataset Score, a practically computable mutual information-based metric computed from Bayesian posteriors on embedded data to quantify the informativeness of curated datasets with respect to test data. Empirical results on MNIST and CIFAR show PMI better distinguishes informative from uninformative or strategically curated data than traditional test-score-based evaluation and remains robust to prior misspecification. By providing a principled, scalable approach to dataset valuation and data curation evaluation, the paper offers a path toward more reliable data-centric AI development and benchmarking.

Abstract

Data plays a central role in advancements in modern artificial intelligence, with high-quality data emerging as a key driver of model performance. This has prompted the development of principled and effective data curation methods in recent years. However, existing methods largely rely on heuristics, and whether they are truly effective remains unclear. For instance, standard evaluation methods that assess a trained model's performance on specific benchmarks may incentivize assigning high scores to data that merely resembles the test set. This issue exemplifies Goodhart's law: when a measure becomes a target, it ceases to be a good measure. To address this issue, we propose an information-theoretic framework for evaluating data curation methods. We define dataset quality in terms of its informativeness about the true model parameters, formalized using the Blackwell ordering of informativeness. Under this ordering, Blackwell's theorem ensures that more informative data yields optimal models with lower expected loss on the true underlying distribution. To measure informativeness, we show that the Blackwell order can be determined by the Shannon mutual information between the curated data and the test data. To estimate this mutual information, we introduce a novel method that trains Bayesian models on embedded datasets and computes mutual information from the posteriors of model parameters. Experiments on real-world data demonstrate that our mutual information-based evaluation assigns appropriately lower scores to data curation strategies that reduce dataset informativeness, while traditional test score-based evaluation methods may favor data curation strategies that overfit to the test set but compromise the training data's informativeness.

Proper Dataset Valuation by Pointwise Mutual Information

TL;DR

This work reframes dataset valuation through an information-theoretic lens, arguing that dataset quality should be measured by informativeness about the true model parameters under Blackwell ordering rather than solely by test-set performance. It introduces the PMI Dataset Score, a practically computable mutual information-based metric computed from Bayesian posteriors on embedded data to quantify the informativeness of curated datasets with respect to test data. Empirical results on MNIST and CIFAR show PMI better distinguishes informative from uninformative or strategically curated data than traditional test-score-based evaluation and remains robust to prior misspecification. By providing a principled, scalable approach to dataset valuation and data curation evaluation, the paper offers a path toward more reliable data-centric AI development and benchmarking.

Abstract

Data plays a central role in advancements in modern artificial intelligence, with high-quality data emerging as a key driver of model performance. This has prompted the development of principled and effective data curation methods in recent years. However, existing methods largely rely on heuristics, and whether they are truly effective remains unclear. For instance, standard evaluation methods that assess a trained model's performance on specific benchmarks may incentivize assigning high scores to data that merely resembles the test set. This issue exemplifies Goodhart's law: when a measure becomes a target, it ceases to be a good measure. To address this issue, we propose an information-theoretic framework for evaluating data curation methods. We define dataset quality in terms of its informativeness about the true model parameters, formalized using the Blackwell ordering of informativeness. Under this ordering, Blackwell's theorem ensures that more informative data yields optimal models with lower expected loss on the true underlying distribution. To measure informativeness, we show that the Blackwell order can be determined by the Shannon mutual information between the curated data and the test data. To estimate this mutual information, we introduce a novel method that trains Bayesian models on embedded datasets and computes mutual information from the posteriors of model parameters. Experiments on real-world data demonstrate that our mutual information-based evaluation assigns appropriately lower scores to data curation strategies that reduce dataset informativeness, while traditional test score-based evaluation methods may favor data curation strategies that overfit to the test set but compromise the training data's informativeness.
Paper Structure (50 sections, 15 theorems, 49 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 50 sections, 15 theorems, 49 equations, 4 figures, 8 tables, 1 algorithm.

Key Result

Theorem 3.3

Suppose ${\bm{\theta}} \to D \to f(A,D)$ forms a Markov chain. Consider the decision problem of selecting a model $h$ from a model class $\mathcal{H}$ to minimize the expected loss using a dataset. Then, the minimum expected loss achievable using $D$ is at least as low as that achievable using $f(A,

Figures (4)

  • Figure 1: Graphical model for non-essential features.
  • Figure 1: Spearman's rank correlation ($\rho$) between estimated and ground-truth mutual information rankings for different estimation methods on Colored MNIST arjovsky2020invariantriskminimization and CIFAR krizhevsky2009learning. Higher values indicate better alignment with the true MI ranking. PMI achieves the strongest correlation. We run 20 independent trials to compute the standard deviation, which quantifies the variability of estimation outcomes across repeated measurements. The Monte Carlo method fails on CIFAR, consistently outputting zero due to numerical instability caused by near-zero likelihoods. The dataset size is $100$.
  • Figure 2: Estimated rankings from different methods on CIFAR. PMI produces the most accurate estimates with the smallest variance. The $x$-axis denotes the ground-truth MI ranking indices, and the $y$-axis denotes the estimated rankings generated by each method. The lines represent the average estimated rankings over 20 trials, while the shaded regions indicate the range of their estimations. The dataset size is $100$. See the results for Colored MNIST in \ref{['app:exp_accuracy']}.
  • Figure 3: Estimated rankings from different methods on Colored MNIST. PMI produces the most accurate estimates with the smallest variance. The $x$-axis denotes the ground-truth MI ranking indices, and the $y$-axis denotes the estimated rankings generated by each method. The lines represent the average estimated rankings over 20 trials, while the shaded regions indicate the range of their estimations. The dataset size is $100$.

Theorems & Definitions (34)

  • Definition 3.1: Data curation method
  • Definition 3.2: Blackwell order of informativeness blackwell1951comparison
  • Theorem 3.3: Informal, blackwell1951comparison
  • Definition 3.4: Strategic data curation
  • Definition 3.5: Scoring function for data curation methods
  • Definition 3.6: Strategy-proof scoring functions
  • Proposition 4.1
  • Theorem 4.2: PMI dataset score
  • Corollary 4.3
  • Definition A.1
  • ...and 24 more