Distinctiveness Maximization in Datasets Assemblage

Tingting Wang; Shixun Huang; Zhifeng Bao; J. Shane Culpepper; Volkan Dedeoglu; Reza Arablouei

Distinctiveness Maximization in Datasets Assemblage

Tingting Wang, Shixun Huang, Zhifeng Bao, J. Shane Culpepper, Volkan Dedeoglu, Reza Arablouei

TL;DR

This paper addresses budgeted dataset acquisition to maximize distinctiveness, defined as the union of query results across a user’s query set relative to a base dataset. It proves NP-hardness and presents two greedy approaches: Exact-Greedy with an (1-1/e)/2 approximation, and ML-Greedy, which uses ML-based estimation to predict the marginal gain of candidate datasets. The ML approach relies on a five-component pipeline that builds per-dataset data summaries, query-aware embeddings, and a learned distinctiveness estimator, achieving large-scale efficiency gains while maintaining competitive effectiveness. Extensive experiments on five real-world data pools show that ML-Greedy substantially outperforms baselines in accuracy, efficiency, and scalability, with a case study demonstrating improved downstream ML task performance when using the assembled datasets.

Abstract

In this paper, given a user's query set and budget, we aim to use the limited budget to help users assemble a set of datasets that can enrich a base dataset by introducing the maximum number of distinct tuples (i.e., maximizing distinctiveness). We prove this problem to be NP-hard. A greedy algorithm using exact distinctiveness computation attains an approximation ratio of (1-1/e)/2, but it lacks efficiency and scalability due to its frequent computation of the exact distinctiveness marginal gain of any candidate dataset for selection. This requires scanning through every tuple in candidate datasets and thus is unaffordable in practice. To overcome this limitation, we propose an efficient machine learning (ML)-based method for estimating the distinctiveness marginal gain of any candidate dataset. This effectively eliminates the need to test each tuple individually. Estimating the distinctiveness marginal gain of a dataset involves estimating the number of distinct tuples in the tuple sets returned by each query in a query set across multiple datasets. This can be viewed as the cardinality estimation for a query set on a set of datasets, and the proposed method is the first to tackle this cardinality estimation problem. This is a significant advancement over prior methods that were limited to single-query cardinality estimation on a single dataset and struggled with identifying overlaps among tuple sets returned by each query in a query set across multiple datasets. Extensive experiments using five real-world data pools demonstrate that our algorithm, which utilizes ML-based distinctiveness estimation, outperforms all relevant baselines in effectiveness, efficiency, and scalability. A case study on two downstream ML tasks also highlights its potential to find datasets with more useful tuples to enhance the performance of ML tasks.

Distinctiveness Maximization in Datasets Assemblage

TL;DR

Abstract

Paper Structure (23 sections, 2 theorems, 7 equations, 14 figures, 6 tables, 4 algorithms)

This paper contains 23 sections, 2 theorems, 7 equations, 14 figures, 6 tables, 4 algorithms.

Introduction
Problem Formulation
Greedy Algorithm using Exact Distinctiveness Computation
Greedy Algorithm using ML-based Distinctiveness Estimation
Component 1: Data Summary Generation
Identifying column sets
Generating column set embeddings
Component 2: Query-aware Dataset Embedding Generation
Component 3: Query-set Embedding Generation
Component 4: Distinctiveness Estimation
Component 5: Merging Data Summaries
The Complete Algorithm
Preparing the Datasets and Queries
Datasets Preparation
Query Set Preparation
...and 8 more sections

Key Result

theorem 1

The DM problem is NP-hard.

Figures (14)

Figure 1: Our data preparation pipeline with advanced datasets assemblage versus existing pipelines. The user input for each stage is shown by color (blue for basic datasets discovery, red for advanced datasets assemblage, and purple for tuples discovery).
Figure 2: An example of Fig. \ref{['fig:pipeline']} where our pipeline achieves the same tuples discovery with a lower budget.
Figure 3: An example for MCE (red for the overlapping tuple).
Figure 4: The ML-based distinctiveness estimation method, where red arrows represent the process of estimating the distinctiveness of a dataset and green arrows represent the process of merging data summaries.
Figure 5: The impact of budget $B$ on the distinctiveness ratio of each algorithm. (line chart is not used since cases are independent)
...and 9 more figures

Theorems & Definitions (6)

definition 1: Distinctiveness
definition 2: Multi-dataset-query cardinality estimation (MCE)
definition 3: Distinctiveness Maximization (DM)
definition 4: Maximum Coverage (MC) Nagarajan
theorem 1
theorem 2

Distinctiveness Maximization in Datasets Assemblage

TL;DR

Abstract

Distinctiveness Maximization in Datasets Assemblage

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (6)