On the Impact of the Utility in Semivalue-based Data Valuation
Mélissa Tamine, Benjamin Heymann, Patrick Loiseau, Maxime Vono
TL;DR
The paper tackles robustness of semivalue-based data valuation to the choice of utility, a practical concern when utilities reflect downstream goals that may vary. It introduces a geometric framework via a dataset spatial signature, embedding data points into a low-dimensional space so that any linear combination of two base utilities corresponds to a projection, and defines a p-swap robustness metric $R_p$ that quantifies ranking stability as the utility direction varies on the unit circle $\mathcal{S}^1$. Empirical evaluations across multiple public datasets and semivalues (Shapley, $(4,1)$-Beta Shapley, Banzhaf) show that ranking stability as measured by $R_p$ aligns with Kendall rank-correlation results, with Banzhaf achieving higher robustness due to near-collinearity of the spatial embedding. The framework provides practitioners with a computationally efficient tool to assess whether semivalue-based data valuation offers reliable guidance under varying utility definitions and to anticipate potential retraining costs when utilities shift.
Abstract
Semivalue-based data valuation uses cooperative-game theory intuitions to assign each data point a value reflecting its contribution to a downstream task. Still, those values depend on the practitioner's choice of utility, raising the question: How robust is semivalue-based data valuation to changes in the utility? This issue is critical when the utility is set as a trade-off between several criteria and when practitioners must select among multiple equally valid utilities. We address it by introducing the notion of a dataset's spatial signature: given a semivalue, we embed each data point into a lower-dimensional space where any utility becomes a linear functional, making the data valuation framework amenable to a simpler geometric picture. Building on this, we propose a practical methodology centered on an explicit robustness metric that informs practitioners whether and by how much their data valuation results will shift as the utility changes. We validate this approach across diverse datasets and semivalues, demonstrating strong agreement with rank-correlation analyses and offering analytical insight into how choosing a semivalue can amplify or diminish robustness.
