Table of Contents
Fetching ...

Accelerating Approximate Analytical Join Queries over Unstructured Data with Statistical Guarantees

Yuxuan Zhu, Tengjun Jin, Chenghao Mo, Daniel Kang

Abstract

Analytical join queries over unstructured data are increasingly prevalent in data analytics. Applying machine learning (ML) models to label every pair in the cross product of tables can achieve state-of-the-art accuracy, but the cost of pairwise execution of ML models is prohibitive. Existing algorithms, such as embedding-based blocking and sampling, aim to reduce this cost. However, they either fail to provide statistical guarantees (leading to errors up to 79% higher than expected) or become as inefficient as uniform sampling. We propose blocking-augmented sampling (BaS), which simultaneously achieves statistical guarantees and high efficiency. BaS optimally orchestrates embedding-based blocking and sampling to mitigate their respective limitations. Specifically, BaS allocates data tuples in the cross product into two regimes based on the failure modes of embeddings. In the regime of false negatives, BaS uses sampling to estimate the result. In the regime of false positives, BaS applies embedding-based blocking to improve efficiency. To minimize the estimation error given a budget for ML executions, we design a novel two-stage algorithm that adaptively allocates the budget between blocking and sampling. Theoretically, we prove that BaS asymptotically outperforms or matches standalone sampling. On real-world datasets across different modalities, we show that BaS provides valid confidence intervals and reduces estimation errors by up to 19$\times$, compared to state-of-the-art baselines.

Accelerating Approximate Analytical Join Queries over Unstructured Data with Statistical Guarantees

Abstract

Analytical join queries over unstructured data are increasingly prevalent in data analytics. Applying machine learning (ML) models to label every pair in the cross product of tables can achieve state-of-the-art accuracy, but the cost of pairwise execution of ML models is prohibitive. Existing algorithms, such as embedding-based blocking and sampling, aim to reduce this cost. However, they either fail to provide statistical guarantees (leading to errors up to 79% higher than expected) or become as inefficient as uniform sampling. We propose blocking-augmented sampling (BaS), which simultaneously achieves statistical guarantees and high efficiency. BaS optimally orchestrates embedding-based blocking and sampling to mitigate their respective limitations. Specifically, BaS allocates data tuples in the cross product into two regimes based on the failure modes of embeddings. In the regime of false negatives, BaS uses sampling to estimate the result. In the regime of false positives, BaS applies embedding-based blocking to improve efficiency. To minimize the estimation error given a budget for ML executions, we design a novel two-stage algorithm that adaptively allocates the budget between blocking and sampling. Theoretically, we prove that BaS asymptotically outperforms or matches standalone sampling. On real-world datasets across different modalities, we show that BaS provides valid confidence intervals and reduces estimation errors by up to 19, compared to state-of-the-art baselines.
Paper Structure (35 sections, 7 theorems, 35 equations, 14 figures, 2 tables, 4 algorithms)

This paper contains 35 sections, 7 theorems, 35 equations, 14 figures, 2 tables, 4 algorithms.

Key Result

lemma 1

With a probability higher than $p$, we can achieve the overall recall target $\gamma$ if $\gamma_s$ satisfies where

Figures (14)

  • Figure 1: Query syntax of JoinML.
  • Figure 2: Given a COUNT query with an Oracle budget of 5,000,000 and a probability of 95%, sampling algorithms leads to high estimation error while blocking fails to achieve statistical guarantees. Achieving statistical guarantees means that the 95th percentile of the true error is less than the error bounded by a 95% CI (to the left of the dashed line).
  • Figure 3: Illustration of Wander Join with indices, without indices, and with approximate indices. Black dots are data records while solid connections are viable paths during random walks. The widths of edges in (c) show the probability of choosing paths.
  • Figure 4: Cross product size and join selectivity of the 16 evaluated datasets: real-world, SemBench, and synthetic workloads.
  • Figure 5: BaS achieves valid statistical guarantees (Error Ratio $\le 1$) across all evaluated aggregators, whereas standard Blocking consistently produces invalid CI.
  • ...and 9 more figures

Theorems & Definitions (7)

  • lemma 1
  • theorem 1
  • theorem 2
  • theorem 3
  • theorem 4
  • lemma 2
  • theorem 5