Table of Contents
Fetching ...

DAVED: Data Acquisition via Experimental Design for Data Markets

Charles Lu, Baihe Huang, Sai Praneeth Karimireddy, Praneeth Vepakomma, Michael Jordan, Ramesh Raskar

TL;DR

DAVED addresses data acquisition in data markets by directly optimizing data selection for unknown test queries without relying on ground-truth validation data. It reframes the problem via V-optimal experimental design, applies a kernelized linearization with a feature map φ (e.g., eNTK embeddings), and solves a budget-constrained proxy loss using federated Frank-Wolfe optimization to select datapoints. The method also proves limitations of validation-based data Shapley approaches (inference-after-selection) and provides a near-optimal relaxation with provable guarantees. Empirically, DAVED achieves lower test error than baselines on synthetic data and multiple medical datasets, and scales to large seller corpora in a fully federated setting, enabling practical decentralized data markets.

Abstract

The acquisition of training data is crucial for machine learning applications. Data markets can increase the supply of data, particularly in data-scarce domains such as healthcare, by incentivizing potential data providers to join the market. A major challenge for a data buyer in such a market is choosing the most valuable data points from a data seller. Unlike prior work in data valuation, which assumes centralized data access, we propose a federated approach to the data acquisition problem that is inspired by linear experimental design. Our proposed data acquisition method achieves lower prediction error without requiring labeled validation data and can be optimized in a fast and federated procedure. The key insight of our work is that a method that directly estimates the benefit of acquiring data for test set prediction is particularly compatible with a decentralized market setting.

DAVED: Data Acquisition via Experimental Design for Data Markets

TL;DR

DAVED addresses data acquisition in data markets by directly optimizing data selection for unknown test queries without relying on ground-truth validation data. It reframes the problem via V-optimal experimental design, applies a kernelized linearization with a feature map φ (e.g., eNTK embeddings), and solves a budget-constrained proxy loss using federated Frank-Wolfe optimization to select datapoints. The method also proves limitations of validation-based data Shapley approaches (inference-after-selection) and provides a near-optimal relaxation with provable guarantees. Empirically, DAVED achieves lower test error than baselines on synthetic data and multiple medical datasets, and scales to large seller corpora in a fully federated setting, enabling practical decentralized data markets.

Abstract

The acquisition of training data is crucial for machine learning applications. Data markets can increase the supply of data, particularly in data-scarce domains such as healthcare, by incentivizing potential data providers to join the market. A major challenge for a data buyer in such a market is choosing the most valuable data points from a data seller. Unlike prior work in data valuation, which assumes centralized data access, we propose a federated approach to the data acquisition problem that is inspired by linear experimental design. Our proposed data acquisition method achieves lower prediction error without requiring labeled validation data and can be optimized in a fast and federated procedure. The key insight of our work is that a method that directly estimates the benefit of acquiring data for test set prediction is particularly compatible with a decentralized market setting.
Paper Structure (18 sections, 7 theorems, 39 equations, 16 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 7 theorems, 39 equations, 16 figures, 2 tables, 1 algorithm.

Key Result

Theorem A.1

Let $w^*$ denote the solution of Problem eq:platform_obj and let $\hat{w}$ denote the solution of Problem eq:data_shapley_opt. Let the data $Z^{\mathrm{val}}, Z^{\mathrm{test}}$ are drawn i.i.d. from the distribution $\mathcal{D}_{X,Y}$ where $\mathcal{D}_{X}$ is supported on $B_R^d$ (zero-centered

Figures (16)

  • Figure 1: Overview of data acquisition process between buyer and seller. A buyer has a budget to acquire training data to get a prediction on their test query. The market platform optimizes the selection of seller data to be most useful for the buyer's query. The selected data is then used to train a regression model and make a prediction on the buyer's test data.
  • Figure 2: Current data valuation methods overfit when data is high-dimensional or validation sets are small. A total of 1,000 seller datapoints (each with cost 1) are available in the market. Each method selects training data for various budgets to train a regression model to predict the buyer's test data. The left plot shows that validation-based data valuation methods overfit when the data is high dimensional, while the right plot shows that they also overfit when the validation set is small. In contrast, our proposed DAVED achieves lower test error across a range of budgets of over 100 buyers.
  • Figure 3: DAVED achieves better test error against other methods on synthetic data. For three amounts of total seller data (1K, 5K, 100K), each method is optimized to select the most valuable training datapoints from the seller to predict the buyer's test data. Our data selection algorithm based on experimental design achieves better test MSE on the buyer's data with fewer training points.
  • Figure 4: DAVED has low test error on real-world medical imaging and drug review data. We compare our method against other data valuation methods on two medical imaging datasets (Fitzpatrick17K and RSNA Bone Age) embedded through CLIP and the DrugLib review dataset embedded through GPT-2. After optimization, each data valuation method selects the top-$k$ most valuable datapoints from the seller to train a regression model to predict the buyer's test data. Our data selection algorithm based on experimental design achieves lower prediction mean squared error on the buyer's data with fewer training points.
  • Figure 5: DAVED has lower runtime than model-based data valuation methods. The left subplot shows runtimes of varying data dimensions when fixing the number of datapoints at 1,000. while the right subplot shows the runtimes of varying the number of total seller datapoints when the dimension is fixed. In both cases, we see that our method is orders of magnitude faster than Data Shapley and the single-step variant of our method is faster than even optimized data valuation methods such as KNN Shapley and model-free methods such as LAVA.
  • ...and 11 more figures

Theorems & Definitions (9)

  • Theorem A.1
  • proof
  • Lemma A.2: Metric entropy, wainwright2019high
  • Lemma A.3: Fano's inequality, cover1999elements
  • Lemma A.4: Matrix-Chernoff bound, tropp2012user
  • Lemma A.5: Paley–Zygmund inequality, paley1932note
  • Lemma B.1: Lemma 5, jaggi2013revisiting
  • Theorem B.2
  • proof