Table of Contents
Fetching ...

Correlation Sketches for Approximate Join-Correlation Queries

Aécio Santos, Aline Bessa, Fernando Chirigati, Christopher Musco, Juliana Freire

TL;DR

This work introduces join-correlation queries to discover datasets that are both joinable on a common key and have attributes correlated with a query column. It presents Correlation Sketches, a hashing-based synopsis that can reconstruct a uniform sample of the join outcome without performing the full join, enabling fast, scalable correlation estimation across large collections. The authors derive confidence-interval bounds for correlation estimates via concentration inequalities and design risk-aware scoring functions to rank candidate datasets under uncertainty. Experimental results on synthetic and real datasets demonstrate accurate correlation estimates, effective ranking improvements, and interactive query times, highlighting practical impact for data discovery and augmentation tasks.

Abstract

The increasing availability of structured datasets, from Web tables and open-data portals to enterprise data, opens up opportunities~to enrich analytics and improve machine learning models through relational data augmentation. In this paper, we introduce a new class of data augmentation queries: join-correlation queries. Given a column $Q$ and a join column $K_Q$ from a query table $\mathcal{T}_Q$, retrieve tables $\mathcal{T}_X$ in a dataset collection such that $\mathcal{T}_X$ is joinable with $\mathcal{T}_Q$ on $K_Q$ and there is a column $C \in \mathcal{T}_X$ such that $Q$ is correlated with $C$. A naïve approach to evaluate these queries, which first finds joinable tables and then explicitly joins and computes correlations between $Q$ and all columns of the discovered tables, is prohibitively expensive. To efficiently support correlated column discovery, we 1) propose a sketching method that enables the construction of an index for a large number of tables and that provides accurate estimates for join-correlation queries, and 2) explore different scoring strategies that effectively rank the query results based on how well the columns are correlated with the query. We carry out a detailed experimental evaluation, using both synthetic and real data, which shows that our sketches attain high accuracy and the scoring strategies lead to high-quality rankings.

Correlation Sketches for Approximate Join-Correlation Queries

TL;DR

This work introduces join-correlation queries to discover datasets that are both joinable on a common key and have attributes correlated with a query column. It presents Correlation Sketches, a hashing-based synopsis that can reconstruct a uniform sample of the join outcome without performing the full join, enabling fast, scalable correlation estimation across large collections. The authors derive confidence-interval bounds for correlation estimates via concentration inequalities and design risk-aware scoring functions to rank candidate datasets under uncertainty. Experimental results on synthetic and real datasets demonstrate accurate correlation estimates, effective ranking improvements, and interactive query times, highlighting practical impact for data discovery and augmentation tasks.

Abstract

The increasing availability of structured datasets, from Web tables and open-data portals to enterprise data, opens up opportunities~to enrich analytics and improve machine learning models through relational data augmentation. In this paper, we introduce a new class of data augmentation queries: join-correlation queries. Given a column and a join column from a query table , retrieve tables in a dataset collection such that is joinable with on and there is a column such that is correlated with . A naïve approach to evaluate these queries, which first finds joinable tables and then explicitly joins and computes correlations between and all columns of the discovered tables, is prohibitively expensive. To efficiently support correlated column discovery, we 1) propose a sketching method that enables the construction of an index for a large number of tables and that provides accurate estimates for join-correlation queries, and 2) explore different scoring strategies that effectively rank the query results based on how well the columns are correlated with the query. We carry out a detailed experimental evaluation, using both synthetic and real data, which shows that our sketches attain high accuracy and the scoring strategies lead to high-quality rankings.

Paper Structure

This paper contains 48 sections, 2 theorems, 22 equations, 5 figures, 2 tables.

Key Result

Theorem 1

The set of paired numeric values $\langle x_k, y_k \rangle \in L_{X \bowtie Y}$ is a uniform random sample of the set of paired numeric values $\langle x_k, y_k \rangle \in \mathcal{T}_{X \bowtie Y}$.

Figures (5)

  • Figure 1: Table $\mathcal{T}_{X \bowtie Y}$ is the join of the input tables $\mathcal{T}_X$ and $\mathcal{T}_Y$, aggregated using the mean function. Correlation Sketches efficiently reconstructs a sample of the table $\mathcal{T}_{X \bowtie Y}$ to estimate correlation between the columns $X_{X \bowtie Y}$ and $X_{X \bowtie Y}$, without computing the full join.
  • Figure 2: The tables $L_{\langle K_X,X \rangle}$ and $L_{\langle K_Y,Y \rangle}$ represent correlation sketches for the tables $\mathcal{T}_X$ and $\mathcal{T}_Y$, for sketch size $n = 3$ and mean aggregation. While we explicitly show the column $h_u(k)$ for illustrative purposes, it does not need to be stored as it can be easily computed from $h(k)$.
  • Figure 3: Estimation errors significantly vary for different samples sizes and different datasets with different data distributions. (a), (b), and (c) show the deviations of all column pairs for 3 different datasets. (d) shows the estimates from (c) after filtering out estimates that use fewer than 20 samples.
  • Figure 4: The sample size (sketch intersection) has an impact on RMSE. As the sketch intersection size increases, the RMSE decreases in the NYC dataset. Here, the $k$ parameter (row) denotes the maximum sketch size (number of minimum values kept in the sketch).
  • Figure 5: Distribution of the evaluation metric scores for different scoring functions. $x$-axis shows slices of the metric range $[0,1]$. Each bar corresponds to a slice of width $0.1$. The $y$-axis show the number of queries that fall in each slice.

Theorems & Definitions (5)

  • Definition 1: Join-Correlation Query
  • Definition 2: Join-Correlation Estimation
  • Theorem 1
  • Definition 3: Top-$k$ Join-Correlation Query
  • Lemma 1: Hoeffding's Inequality