Unifying and Optimizing Data Values for Selection via Sequential-Decision-Making
Hongliang Chi, Qiong Wu, Zhengyi Zhou, Jonathan Light, Emily Dodwell, Yao Ma
TL;DR
This work reframes data selection as a sequential decision problem and unifies data-valuations under an approximate dynamic programming (ADP) lens, showing that Data Shapley and related semi-values are myopic ADP solutions under linear surrogate rewards with $U(S)$ as the utility. It derives theoretical guarantees for optimality under linear utilities and for monotone submodular utilities with curvature $c$, yielding a bound $U(G_k) \ge (1-c)^2 U(OPT_k)$ and an analogous sequential bound. To scale to large datasets, the authors introduce bipartite surrogate models that represent utility via coverage on a graph, preserving monotonicity and submodularity so that greedy selection remains effective; they show that the optimum data values relate to $v^*(i) = n - t^*(i)$ where $t^*(i)$ is the optimal step for $i$. Empirically, the bipartite approach substantially outperforms existing valuations, especially in early-stage data selection, across eight OpenML datasets and multiple budgets. Overall, the framework provides principled insight into data-valuations, offers efficient approximation methods, and demonstrates practical impact for data curation in diverse settings.
Abstract
Data selection has emerged as a crucial downstream application of data valuation. While existing data valuation methods have shown promise in selection tasks, the theoretical foundations and full potential of using data values for selection remain largely unexplored. In this work, we first demonstrate that data values applied for selection can be naturally reformulated as a sequential-decision-making problem, where the optimal data value can be derived through dynamic programming. We show this framework unifies and reinterprets existing methods like Data Shapley through the lens of approximate dynamic programming, specifically as myopic reward function approximations to this sequential problem. Furthermore, we analyze how sequential data selection optimality is affected when the ground-truth utility function exhibits monotonic submodularity with curvature. To address the computational challenges in obtaining optimal data values, we propose an efficient approximation scheme using learned bipartite graphs as surrogate utility models, ensuring greedy selection is still optimal when the surrogate utility is correctly specified and learned. Extensive experiments demonstrate the effectiveness of our approach across diverse datasets.
