Table of Contents
Fetching ...

Poisson Sampling over Acyclic Joins

Liese Bekkers, Frank Neven, Lorrens Pantelis, Stijn Vansummeren

TL;DR

This work proposes an algorithm for Poisson sampling over acyclic joins that is nearly instance-optimal, running in time O(N + k \log N) where N is the size of the input database, and k is the size of the resulting sample.

Abstract

We introduce the problem of Poisson sampling over joins: compute a sample of the result of a join query by conceptually performing a Bernoulli trial for each join tuple, using a non-uniform and tuple-specific probability. We propose an algorithm for Poisson sampling over acyclic joins that is nearly instance-optimal, running in time O(N + k \log N) where N is the size of the input database, and k is the size of the resulting sample. Our algorithm hinges on two building blocks: (1) The construction of a random-access index that allows, given a number i, to randomly access the i-th join tuple without fully materializing the (possibly large) join result; (2) The probing of this index to construct the result sample. We study the engineering trade-offs required to make both components practical, focusing on their implementation in column stores, and identify best-performing alternatives for both. Our experiments on real-world data demonstrate that this pair of alternatives significantly outperforms the repeated-Bernoulli-trial algorithm for Poisson sampling while also demonstrating that the random-access index by itself can be used to competively implement Yannakakis' acyclic join processing algorithm when no sampling is required. This shows that, as far a query engine design is concerned, it is possible to adopt a uniform basis for both classical acyclic join processing and Poisson sampling, both without regret compared to classical join and sampling algorithms.

Poisson Sampling over Acyclic Joins

TL;DR

This work proposes an algorithm for Poisson sampling over acyclic joins that is nearly instance-optimal, running in time O(N + k \log N) where N is the size of the input database, and k is the size of the resulting sample.

Abstract

We introduce the problem of Poisson sampling over joins: compute a sample of the result of a join query by conceptually performing a Bernoulli trial for each join tuple, using a non-uniform and tuple-specific probability. We propose an algorithm for Poisson sampling over acyclic joins that is nearly instance-optimal, running in time O(N + k \log N) where N is the size of the input database, and k is the size of the resulting sample. Our algorithm hinges on two building blocks: (1) The construction of a random-access index that allows, given a number i, to randomly access the i-th join tuple without fully materializing the (possibly large) join result; (2) The probing of this index to construct the result sample. We study the engineering trade-offs required to make both components practical, focusing on their implementation in column stores, and identify best-performing alternatives for both. Our experiments on real-world data demonstrate that this pair of alternatives significantly outperforms the repeated-Bernoulli-trial algorithm for Poisson sampling while also demonstrating that the random-access index by itself can be used to competively implement Yannakakis' acyclic join processing algorithm when no sampling is required. This shows that, as far a query engine design is concerned, it is possible to adopt a uniform basis for both classical acyclic join processing and Poisson sampling, both without regret compared to classical join and sampling algorithms.
Paper Structure (13 sections, 4 theorems, 10 equations, 10 figures, 4 tables)

This paper contains 13 sections, 4 theorems, 10 equations, 10 figures, 4 tables.

Key Result

proposition 1

For every acyclic join query $\hat{Q}$ and any attribute $y$ of $\hat{Q}$ we can compute an equivalent two-phase NSA expression $\mu^*(E)$ such that $y$ is a flat attribute in the output scheme of $E$.

Figures (10)

  • Figure 1: A join tree for the join query of Example \ref{['ex:poisson-query']}.
  • Figure 2: Illustration of nested relations and nested semijoins with $N_2=$$(R(x,y,p) \mathop{\mathrm{\ltimes_{\nu}}}\nolimits S(u,a,x))\mathop{\mathrm{\ltimes_{\nu}}}\nolimits T(v,y)$ and $N_1 = R(x,y,p)$$\mathop{\mathrm{\ltimes_{\nu}}}\nolimits S(u,a,x)$.
  • Figure 3: CSR grouping algorithm.
  • Figure 4: CSR random access.
  • Figure 5: USR Access
  • ...and 5 more figures

Theorems & Definitions (5)

  • proposition 1
  • definition 1
  • proposition 2
  • proposition 3
  • theorem 1