Table of Contents
Fetching ...

Unifying and Optimizing Data Values for Selection via Sequential-Decision-Making

Hongliang Chi, Qiong Wu, Zhengyi Zhou, Jonathan Light, Emily Dodwell, Yao Ma

TL;DR

This work reframes data selection as a sequential decision problem and unifies data-valuations under an approximate dynamic programming (ADP) lens, showing that Data Shapley and related semi-values are myopic ADP solutions under linear surrogate rewards with $U(S)$ as the utility. It derives theoretical guarantees for optimality under linear utilities and for monotone submodular utilities with curvature $c$, yielding a bound $U(G_k) \ge (1-c)^2 U(OPT_k)$ and an analogous sequential bound. To scale to large datasets, the authors introduce bipartite surrogate models that represent utility via coverage on a graph, preserving monotonicity and submodularity so that greedy selection remains effective; they show that the optimum data values relate to $v^*(i) = n - t^*(i)$ where $t^*(i)$ is the optimal step for $i$. Empirically, the bipartite approach substantially outperforms existing valuations, especially in early-stage data selection, across eight OpenML datasets and multiple budgets. Overall, the framework provides principled insight into data-valuations, offers efficient approximation methods, and demonstrates practical impact for data curation in diverse settings.

Abstract

Data selection has emerged as a crucial downstream application of data valuation. While existing data valuation methods have shown promise in selection tasks, the theoretical foundations and full potential of using data values for selection remain largely unexplored. In this work, we first demonstrate that data values applied for selection can be naturally reformulated as a sequential-decision-making problem, where the optimal data value can be derived through dynamic programming. We show this framework unifies and reinterprets existing methods like Data Shapley through the lens of approximate dynamic programming, specifically as myopic reward function approximations to this sequential problem. Furthermore, we analyze how sequential data selection optimality is affected when the ground-truth utility function exhibits monotonic submodularity with curvature. To address the computational challenges in obtaining optimal data values, we propose an efficient approximation scheme using learned bipartite graphs as surrogate utility models, ensuring greedy selection is still optimal when the surrogate utility is correctly specified and learned. Extensive experiments demonstrate the effectiveness of our approach across diverse datasets.

Unifying and Optimizing Data Values for Selection via Sequential-Decision-Making

TL;DR

This work reframes data selection as a sequential decision problem and unifies data-valuations under an approximate dynamic programming (ADP) lens, showing that Data Shapley and related semi-values are myopic ADP solutions under linear surrogate rewards with as the utility. It derives theoretical guarantees for optimality under linear utilities and for monotone submodular utilities with curvature , yielding a bound and an analogous sequential bound. To scale to large datasets, the authors introduce bipartite surrogate models that represent utility via coverage on a graph, preserving monotonicity and submodularity so that greedy selection remains effective; they show that the optimum data values relate to where is the optimal step for . Empirically, the bipartite approach substantially outperforms existing valuations, especially in early-stage data selection, across eight OpenML datasets and multiple budgets. Overall, the framework provides principled insight into data-valuations, offers efficient approximation methods, and demonstrates practical impact for data curation in diverse settings.

Abstract

Data selection has emerged as a crucial downstream application of data valuation. While existing data valuation methods have shown promise in selection tasks, the theoretical foundations and full potential of using data values for selection remain largely unexplored. In this work, we first demonstrate that data values applied for selection can be naturally reformulated as a sequential-decision-making problem, where the optimal data value can be derived through dynamic programming. We show this framework unifies and reinterprets existing methods like Data Shapley through the lens of approximate dynamic programming, specifically as myopic reward function approximations to this sequential problem. Furthermore, we analyze how sequential data selection optimality is affected when the ground-truth utility function exhibits monotonic submodularity with curvature. To address the computational challenges in obtaining optimal data values, we propose an efficient approximation scheme using learned bipartite graphs as surrogate utility models, ensuring greedy selection is still optimal when the surrogate utility is correctly specified and learned. Extensive experiments demonstrate the effectiveness of our approach across diverse datasets.

Paper Structure

This paper contains 51 sections, 14 theorems, 48 equations, 6 figures, 1 table, 2 algorithms.

Key Result

Theorem 4.1

For game-theoretic data valuation methods (Data Shapley, Beta Shapley and Data Banzhaf) that assign values $v: \mathcal{D} \rightarrow \mathbb{R}$, and induce a data sequence $\pi_v$ ranked by values, this sequence can be equivalently obtained as a solution to our sequential decision MDP under the f such that the trajectory generated by this ADP solution exactly matches the data sequence $\pi_v$ i

Figures (6)

  • Figure 1: Demonstration of data selection performance curves with different data values. The x-axis shows the selection size (from 0 to dataset size $n$), and y-axis represents test accuracy. Vertical dashed lines indicate different selection budgets. Superior data values (Data Values I) achieve both steeper initial curves and consistently higher performance across all budgets compared to Data Values II.
  • Figure 2: Framework for sequential data selection. Our framework consists of three components: (1) A sequential data decision problem formulating data selection through step-by-step decision-making. (2) Core components of any solution for this sequential problem including reward modeling and decision policies. (3) Selection performance curves showing outcomes from different reward modeling plus decision policy combinations, where exact DP achieves optimal performance (blue), linear ADP yields suboptimal results (gray), and bipartite approximation attains near-optimal result (yellow).
  • Figure 3: Performance comparison between optimal sequential selection (DynamicProgramming, solid gray) and existing data valuation methods across eight datasets. Results reveal performance gaps between existing methods and the optimal policy.
  • Figure 4: Evaluation of our proposed bipartite-based method against baselines on eight datasets. Our bipartite approach (solid yellow) demonstrates superior efficiency, requiring significantly fewer samples to achieve comparable accuracy.
  • Figure 5: Geometric visualization of increasing utility curvature through message passing: The feature space distribution of a three-class classification dataset evolves as the propagation proportion increases from $0.0$ to $1.0$. Points within each class progressively converge through feature averaging, illustrating the transition from low to high substitutability regimes predicted by our curvature analysis.
  • ...and 1 more figures

Theorems & Definitions (33)

  • Definition 2.1: Score-based Data Values
  • Definition 2.2: Game-theoretic Data Values
  • Definition 2.3: Data Values for Data Selection
  • Definition 3.2: Sequential Data Selection Problem
  • Theorem 4.1: Game-theoretic Data Values as ADP Solutions
  • Definition 4.2: Linear Utility Function
  • Theorem 4.3: Optimality Under Linear Utility
  • Definition 4.4: Monotonic Submodular Function
  • Definition 4.5: Curvature
  • Theorem 4.6
  • ...and 23 more