Table of Contents
Fetching ...

The Economics of AI Training Data: A Research Agenda

Hamidah Oderinwale, Anna Kazlauskas

TL;DR

The paper tackles the problem of valuing and integrating data into AI production, arguing that data is a distinct, nonrival input that requires formal economic treatment. It documents current data exchange and pricing, proposes a hierarchical framework of exchangeable data units, and demonstrates how data could be represented explicitly in production functions, notably with $Y = f(K,L,D,A)$. It further outlines four open research questions—context-dependent valuation, governance with privacy, empirical estimation of data's production contribution, and market design for compositional goods—to guide the development of data economics. The work aims to catalyze cross-disciplinary collaboration among economics, computer science, law, and policy to shape data markets, governance, and investment in AI.

Abstract

Despite data's central role in AI production, it remains the least understood input. As AI labs exhaust public data and turn to proprietary sources, with deals reaching hundreds of millions of dollars, research across computer science, economics, law, and policy has fragmented. We establish data economics as a coherent field through three contributions. First, we characterize data's distinctive properties -- nonrivalry, context dependence, and emergent rivalry through contamination -- and trace historical precedents for market formation in commodities such as oil and grain. Second, we present systematic documentation of AI training data deals from 2020 to 2025, revealing persistent market fragmentation, five distinct pricing mechanisms (from per-unit licensing to commissioning), and that most deals exclude original creators from compensation. Third, we propose a formal hierarchy of exchangeable data units (token, record, dataset, corpus, stream) and argue for data's explicit representation in production functions. Building on these foundations, we outline four open research problems foundational to data economics: measuring context-dependent value, balancing governance with privacy, estimating data's contribution to production, and designing mechanisms for heterogeneous, compositional goods.

The Economics of AI Training Data: A Research Agenda

TL;DR

The paper tackles the problem of valuing and integrating data into AI production, arguing that data is a distinct, nonrival input that requires formal economic treatment. It documents current data exchange and pricing, proposes a hierarchical framework of exchangeable data units, and demonstrates how data could be represented explicitly in production functions, notably with . It further outlines four open research questions—context-dependent valuation, governance with privacy, empirical estimation of data's production contribution, and market design for compositional goods—to guide the development of data economics. The work aims to catalyze cross-disciplinary collaboration among economics, computer science, law, and policy to shape data markets, governance, and investment in AI.

Abstract

Despite data's central role in AI production, it remains the least understood input. As AI labs exhaust public data and turn to proprietary sources, with deals reaching hundreds of millions of dollars, research across computer science, economics, law, and policy has fragmented. We establish data economics as a coherent field through three contributions. First, we characterize data's distinctive properties -- nonrivalry, context dependence, and emergent rivalry through contamination -- and trace historical precedents for market formation in commodities such as oil and grain. Second, we present systematic documentation of AI training data deals from 2020 to 2025, revealing persistent market fragmentation, five distinct pricing mechanisms (from per-unit licensing to commissioning), and that most deals exclude original creators from compensation. Third, we propose a formal hierarchy of exchangeable data units (token, record, dataset, corpus, stream) and argue for data's explicit representation in production functions. Building on these foundations, we outline four open research problems foundational to data economics: measuring context-dependent value, balancing governance with privacy, estimating data's contribution to production, and designing mechanisms for heterogeneous, compositional goods.

Paper Structure

This paper contains 13 sections, 1 equation, 2 figures, 1 table.

Figures (2)

  • Figure 1: Three stylized models for data's contribution to AI production: diminishing returns (capital-like), sustained or increasing returns with quality, and inverted-U under contamination or overuse.
  • Figure 2: Machine learning pipeline showing data's distinct roles across pre-training, fine-tuning, and inference stages.