Table of Contents
Fetching ...

Ambidextrous Degree Sequence Bounds for Pessimistic Cardinality Estimation

Yu-Ting Lin, Hsin-Po Wang

TL;DR

This work addresses pessimistic cardinality estimation for join queries by reframing the problem as bounding the entropy $H(X_1,\dots,X_n)$ via an entropy-based three-step framework, and by extending bounds through bi-degree sequences to bi-variate moments. It introduces ambidextrous bounds that replace single-edge terms with a bi-degree moment building block, proving the main inequality $p H(Y|X) + I(X;Y) + q H(X|Y) \le \ln_p \|R(A,B)\|_q$ and showing these bounds dominate the older dexterous bounds by Hölder-convexity. The authors derive new bounds (e.g., a bound on $\#_\Delta^3$) using a Venn-diagram fractional-covering criterion, and demonstrate convexity enables efficient optimization to identify tight bounds. Empirical evaluations on real-world SNAP graphs indicate ambidextrous bounds provide significantly tighter estimates than dexterous bounds, with a measured overshoot trend indicating substantial practical gains for query planning and resource allocation in large databases. The work paves the way for incorporating sketching and spline-based approaches to balance speed and accuracy in bound discovery.

Abstract

In a large database system, upper-bounding the cardinality of a join query is a crucial task called $\textit{pessimistic cardinality estimation}$. Recently, Abo Khamis, Nakos, Olteanu, and Suciu unified related works into the following dexterous framework. Step 1: Let $(X_1, \dotsc, X_n)$ be a random row of the join, equating $H(X_1, \dotsc, X_n)$ to the log of the join cardinality. Step 2: Upper-bound $H(X_1, \dotsc, X_n)$ using Shannon-type inequalities such as $H(X, Y, Z) \le H(X) + H(Y|X) + H(Z|Y)$. Step 3: Upper-bound $H(X_i) + p H(X_j | X_i)$ using the $p$-norm of the degree sequence of the underlying graph of a relation. While old bound in step 3 count "claws $\in$" in the underlying graph, we proposed $\textit{ambidextrous}$ bounds that count "claw pairs ${\ni}\!{-}\!{\in}$". The new bounds are provably not looser and empirically tighter: they overestimate by $x^{3/4}$ times when the old bounds overestimate by $x$ times. An example is counting friend triples in the $\texttt{com-Youtube}$ dataset, the best dexterous bound is $1.2 \cdot 10^9$, the best ambidextrous bound is $5.1 \cdot 10^8$, and the actual cardinality is $1.8 \cdot 10^7$.

Ambidextrous Degree Sequence Bounds for Pessimistic Cardinality Estimation

TL;DR

This work addresses pessimistic cardinality estimation for join queries by reframing the problem as bounding the entropy via an entropy-based three-step framework, and by extending bounds through bi-degree sequences to bi-variate moments. It introduces ambidextrous bounds that replace single-edge terms with a bi-degree moment building block, proving the main inequality and showing these bounds dominate the older dexterous bounds by Hölder-convexity. The authors derive new bounds (e.g., a bound on ) using a Venn-diagram fractional-covering criterion, and demonstrate convexity enables efficient optimization to identify tight bounds. Empirical evaluations on real-world SNAP graphs indicate ambidextrous bounds provide significantly tighter estimates than dexterous bounds, with a measured overshoot trend indicating substantial practical gains for query planning and resource allocation in large databases. The work paves the way for incorporating sketching and spline-based approaches to balance speed and accuracy in bound discovery.

Abstract

In a large database system, upper-bounding the cardinality of a join query is a crucial task called . Recently, Abo Khamis, Nakos, Olteanu, and Suciu unified related works into the following dexterous framework. Step 1: Let be a random row of the join, equating to the log of the join cardinality. Step 2: Upper-bound using Shannon-type inequalities such as . Step 3: Upper-bound using the -norm of the degree sequence of the underlying graph of a relation. While old bound in step 3 count "claws " in the underlying graph, we proposed bounds that count "claw pairs ". The new bounds are provably not looser and empirically tighter: they overestimate by times when the old bounds overestimate by times. An example is counting friend triples in the dataset, the best dexterous bound is , the best ambidextrous bound is , and the actual cardinality is .

Paper Structure

This paper contains 15 sections, 7 theorems, 51 equations, 16 figures.

Key Result

Lemma 1

Let $X$, $Y$, and $Z$ be three random variables that may or may not correlate. Then A common paraphrase is $H(Z|Y) \geqslant H(Z | X, Y)$, wherein $H(Z|Y)$ denotes the conditional entropy and is defined to be $H(Y, Z) - H(Y)$. Another common paraphrase is $I(X; Z | Y) \geqslant 0$, wherein $I(X; Z | Y)$ denotes the conditional mutual information and is defined to be $H(Z|Y) - H(Z |

Figures (16)

  • Figure 1: Bounding $\#_\Delta$ can be viewed as a fractional covering problem on the entropy Venn diagram [left most]. The building blocks are , , , and , which correspond to \ref{['p=0']}, \ref{['p=1']}, \ref{['p=oo']}, and \ref{['p=p']}, respectively. Our contribution can be viewed as inventing a new building block , which corresponds to \ref{['p1q']}. This generates new bounds such as \ref{['345']}. See Lemma \ref{['lem:venn']} to learn how to generate more. Note that this Venn diagram viewpoint is not valid for four or more variables, as the signs of intersection information terms are not clear.
  • Figure 2: The number of quadruples $(a, b_1, b_2, b_3)$ such that $(a, b_1), (a, b_2), (a, b_3) \in \mathsf Z \coloneqq \{ (1, 2), (3, 2), (3, 4) \}$ is $._{3} \mkern-1mu \| \mathsf Z \bigr\| ._{1} = 9$.
  • Figure 3: The number of quadruples $(a_1, a_2, a_3, b)$ such that $(a_1, b), (a_2, b), (a_3, b) \in \mathsf Z \coloneqq \{ (1, 2), (3, 2), (3, 4) \}$ is $._{1} \mkern-1mu \| \mathsf Z \bigr\| ._{3} = 9$.
  • Figure 4: The number of pairs $(b, a)$ such that $(a, b)\in \mathsf Z \coloneqq \{ (1, 2), (3, 2), (3, 4) \}$ is $._{1} \mkern-1mu \| \mathsf Z \bigr\| ._{1} = 3$.
  • Figure 5: The number of quadruples $(a_1, b, a, b_1)$ such that $(a, b), (a, b_1), (a_1, b) \in \mathsf Z \coloneqq \{ (1, 2), (3, 2), (3, 4) \}$ is $._{2} \mkern-1mu \| \mathsf Z \bigr\| ._{2} = 8$.
  • ...and 11 more figures

Theorems & Definitions (15)

  • Lemma 1: submodularity
  • proof
  • Proposition 2
  • proof
  • Theorem 3: main
  • proof
  • Lemma 4: Venn criterion
  • proof : Proof of necessity (i.e., the only if part, $\Rightarrow$)
  • proof : Proof of sufficiency (i.e., the if part, $\Leftarrow$)
  • Theorem 5: Hölder and convexity
  • ...and 5 more