Table of Contents
Fetching ...

Data Measurements for Decentralized Data Markets

Charles Lu, Mohammad Mohammadi Amiri, Ramesh Raskar

TL;DR

The paper tackles the challenge of seller selection in decentralized data markets by introducing federated data measurements that signal data value without exposing raw data. Buyers generate a private query from their data embeddings and receive measurements from sellers; these measurements are mapped through a set of relevance and diversity metrics to rank sellers and anticipate downstream performance. Empirical results across 20 computer vision datasets and MedMNIST tasks show relevance metrics excel at ranking, while diversity metrics correlate with generalization performance, with robustness to duplicates and noise demonstrated and misreporting mitigated via multiple queries. The work demonstrates a privacy-preserving, scalable approach to data valuation that can lower search costs and enable more equitable, broker-less data marketplaces, while outlining practical limitations and directions for future enhancements.

Abstract

Decentralized data markets can provide more equitable forms of data acquisition for machine learning. However, to realize practical marketplaces, efficient techniques for seller selection need to be developed. We propose and benchmark federated data measurements to allow a data buyer to find sellers with relevant and diverse datasets. Diversity and relevance measures enable a buyer to make relative comparisons between sellers without requiring intermediate brokers and training task-dependent models.

Data Measurements for Decentralized Data Markets

TL;DR

The paper tackles the challenge of seller selection in decentralized data markets by introducing federated data measurements that signal data value without exposing raw data. Buyers generate a private query from their data embeddings and receive measurements from sellers; these measurements are mapped through a set of relevance and diversity metrics to rank sellers and anticipate downstream performance. Empirical results across 20 computer vision datasets and MedMNIST tasks show relevance metrics excel at ranking, while diversity metrics correlate with generalization performance, with robustness to duplicates and noise demonstrated and misreporting mitigated via multiple queries. The work demonstrates a privacy-preserving, scalable approach to data valuation that can lower search costs and enable more equitable, broker-less data marketplaces, while outlining practical limitations and directions for future enhancements.

Abstract

Decentralized data markets can provide more equitable forms of data acquisition for machine learning. However, to realize practical marketplaces, efficient techniques for seller selection need to be developed. We propose and benchmark federated data measurements to allow a data buyer to find sellers with relevant and diverse datasets. Diversity and relevance measures enable a buyer to make relative comparisons between sellers without requiring intermediate brokers and training task-dependent models.
Paper Structure (18 sections, 1 equation, 11 figures, 3 tables)

This paper contains 18 sections, 1 equation, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Steps of data measurements framework. A buyer embeds their data through some embedding model and sends a private query of matrix projections to each seller. Each seller responds with data measurements that allow the buyer to compare and transact with sellers that have the most relevant data.
  • Figure 2: Effect of duplicate data on data measurements. Each seller has 10,000 total datapoints, and a subset of datapoints are duplicated, keeping the total number of datapoints the same. Each colored dotted line represents an individual dataset, and the solid black line represents the average of all datasets. Errors bars represent one standard deviation.
  • Figure 3: Effect of different types of noise corruptions on each data measurement. See Figure \ref{['fig:corruption-images']} for example images on the ImageNet-C dataset.
  • Figure 4: Varying the amount of data each IID seller has while fixing the buyer query to 100 datapoints.
  • Figure 5: Varying the amount of data in the buyer query has while fixing each seller to 5,000 datapoints.
  • ...and 6 more figures