Distinctiveness Maximization in Datasets Assemblage
Tingting Wang, Shixun Huang, Zhifeng Bao, J. Shane Culpepper, Volkan Dedeoglu, Reza Arablouei
TL;DR
This paper addresses budgeted dataset acquisition to maximize distinctiveness, defined as the union of query results across a user’s query set relative to a base dataset. It proves NP-hardness and presents two greedy approaches: Exact-Greedy with an (1-1/e)/2 approximation, and ML-Greedy, which uses ML-based estimation to predict the marginal gain of candidate datasets. The ML approach relies on a five-component pipeline that builds per-dataset data summaries, query-aware embeddings, and a learned distinctiveness estimator, achieving large-scale efficiency gains while maintaining competitive effectiveness. Extensive experiments on five real-world data pools show that ML-Greedy substantially outperforms baselines in accuracy, efficiency, and scalability, with a case study demonstrating improved downstream ML task performance when using the assembled datasets.
Abstract
In this paper, given a user's query set and budget, we aim to use the limited budget to help users assemble a set of datasets that can enrich a base dataset by introducing the maximum number of distinct tuples (i.e., maximizing distinctiveness). We prove this problem to be NP-hard. A greedy algorithm using exact distinctiveness computation attains an approximation ratio of (1-1/e)/2, but it lacks efficiency and scalability due to its frequent computation of the exact distinctiveness marginal gain of any candidate dataset for selection. This requires scanning through every tuple in candidate datasets and thus is unaffordable in practice. To overcome this limitation, we propose an efficient machine learning (ML)-based method for estimating the distinctiveness marginal gain of any candidate dataset. This effectively eliminates the need to test each tuple individually. Estimating the distinctiveness marginal gain of a dataset involves estimating the number of distinct tuples in the tuple sets returned by each query in a query set across multiple datasets. This can be viewed as the cardinality estimation for a query set on a set of datasets, and the proposed method is the first to tackle this cardinality estimation problem. This is a significant advancement over prior methods that were limited to single-query cardinality estimation on a single dataset and struggled with identifying overlaps among tuple sets returned by each query in a query set across multiple datasets. Extensive experiments using five real-world data pools demonstrate that our algorithm, which utilizes ML-based distinctiveness estimation, outperforms all relevant baselines in effectiveness, efficiency, and scalability. A case study on two downstream ML tasks also highlights its potential to find datasets with more useful tuples to enhance the performance of ML tasks.
