DAVED: Data Acquisition via Experimental Design for Data Markets
Charles Lu, Baihe Huang, Sai Praneeth Karimireddy, Praneeth Vepakomma, Michael Jordan, Ramesh Raskar
TL;DR
DAVED addresses data acquisition in data markets by directly optimizing data selection for unknown test queries without relying on ground-truth validation data. It reframes the problem via V-optimal experimental design, applies a kernelized linearization with a feature map φ (e.g., eNTK embeddings), and solves a budget-constrained proxy loss using federated Frank-Wolfe optimization to select datapoints. The method also proves limitations of validation-based data Shapley approaches (inference-after-selection) and provides a near-optimal relaxation with provable guarantees. Empirically, DAVED achieves lower test error than baselines on synthetic data and multiple medical datasets, and scales to large seller corpora in a fully federated setting, enabling practical decentralized data markets.
Abstract
The acquisition of training data is crucial for machine learning applications. Data markets can increase the supply of data, particularly in data-scarce domains such as healthcare, by incentivizing potential data providers to join the market. A major challenge for a data buyer in such a market is choosing the most valuable data points from a data seller. Unlike prior work in data valuation, which assumes centralized data access, we propose a federated approach to the data acquisition problem that is inspired by linear experimental design. Our proposed data acquisition method achieves lower prediction error without requiring labeled validation data and can be optimized in a fast and federated procedure. The key insight of our work is that a method that directly estimates the benefit of acquiring data for test set prediction is particularly compatible with a decentralized market setting.
