Data Measurements for Decentralized Data Markets
Charles Lu, Mohammad Mohammadi Amiri, Ramesh Raskar
TL;DR
The paper tackles the challenge of seller selection in decentralized data markets by introducing federated data measurements that signal data value without exposing raw data. Buyers generate a private query from their data embeddings and receive measurements from sellers; these measurements are mapped through a set of relevance and diversity metrics to rank sellers and anticipate downstream performance. Empirical results across 20 computer vision datasets and MedMNIST tasks show relevance metrics excel at ranking, while diversity metrics correlate with generalization performance, with robustness to duplicates and noise demonstrated and misreporting mitigated via multiple queries. The work demonstrates a privacy-preserving, scalable approach to data valuation that can lower search costs and enable more equitable, broker-less data marketplaces, while outlining practical limitations and directions for future enhancements.
Abstract
Decentralized data markets can provide more equitable forms of data acquisition for machine learning. However, to realize practical marketplaces, efficient techniques for seller selection need to be developed. We propose and benchmark federated data measurements to allow a data buyer to find sellers with relevant and diverse datasets. Diversity and relevance measures enable a buyer to make relative comparisons between sellers without requiring intermediate brokers and training task-dependent models.
