An Instrumental Value for Data Production and its Application to Data Pricing
Rui Ai, Boxiang Lyu, Zhaoran Wang, Zhuoran Yang, Haifeng Xu
TL;DR
This paper develops a principled, context-dependent valuation of data production that captures its instrumental value for decision-making. By embedding data production in a contextual Bayesian decision framework and linking value to generalized Bregman divergence, the authors show that in Bayesian linear regression the data value equals information gain, enabling closed-form characterizations. In a monopoly pricing setting, perfect data customization allows exact extraction of first-best revenue, while limited customization yields near-optimal revenue within a log-condition-number gap, with a corollary showing first-best revenue in multi-armed bandit scenarios. The work highlights the potential for significant price discrimination in data markets and provides algorithmic mechanisms (notably SVD-based) to implement near-optimal pricing, informing both market design and regulatory debates.
Abstract
How much value does a dataset or a data production process have to an agent who wishes to use the data to assist decision-making? This is a fundamental question towards understanding the value of data as well as further pricing of data. This paper develops an approach for capturing the instrumental value of data production processes, which takes two key factors into account: (a) the context of the agent's decision-making problem; (b) prior data or information the agent already possesses. We ''micro-found'' our valuation concepts by showing how they connect to classic notions of information design and signals in information economics. When instantiated in the domain of Bayesian linear regression, our value naturally corresponds to information gain. Based on our designed data value, we then study a basic monopoly pricing setting with a buyer looking to purchase from a seller some labeled data of a certain feature direction in order to improve a Bayesian regression model. We show that when the seller has the ability to fully customize any data request, she can extract the first-best revenue (i.e., full surplus) from any population of buyers, i.e., achieving first-degree price discrimination. If the seller can only sell data that are derived from an existing data pool, this limits her ability to customize, and achieving first-best revenue becomes generally impossible. However, we design a mechanism that achieves seller revenue at most $\log (κ)$ less than the first-best revenue, where $κ$ is the condition number associated with the data matrix. A corollary of this result is that the seller can extract the first-best revenue in the multi-armed bandits special case.
