Ambidextrous Degree Sequence Bounds for Pessimistic Cardinality Estimation
Yu-Ting Lin, Hsin-Po Wang
TL;DR
This work addresses pessimistic cardinality estimation for join queries by reframing the problem as bounding the entropy $H(X_1,\dots,X_n)$ via an entropy-based three-step framework, and by extending bounds through bi-degree sequences to bi-variate moments. It introduces ambidextrous bounds that replace single-edge terms with a bi-degree moment building block, proving the main inequality $p H(Y|X) + I(X;Y) + q H(X|Y) \le \ln_p \|R(A,B)\|_q$ and showing these bounds dominate the older dexterous bounds by Hölder-convexity. The authors derive new bounds (e.g., a bound on $\#_\Delta^3$) using a Venn-diagram fractional-covering criterion, and demonstrate convexity enables efficient optimization to identify tight bounds. Empirical evaluations on real-world SNAP graphs indicate ambidextrous bounds provide significantly tighter estimates than dexterous bounds, with a measured overshoot trend indicating substantial practical gains for query planning and resource allocation in large databases. The work paves the way for incorporating sketching and spline-based approaches to balance speed and accuracy in bound discovery.
Abstract
In a large database system, upper-bounding the cardinality of a join query is a crucial task called $\textit{pessimistic cardinality estimation}$. Recently, Abo Khamis, Nakos, Olteanu, and Suciu unified related works into the following dexterous framework. Step 1: Let $(X_1, \dotsc, X_n)$ be a random row of the join, equating $H(X_1, \dotsc, X_n)$ to the log of the join cardinality. Step 2: Upper-bound $H(X_1, \dotsc, X_n)$ using Shannon-type inequalities such as $H(X, Y, Z) \le H(X) + H(Y|X) + H(Z|Y)$. Step 3: Upper-bound $H(X_i) + p H(X_j | X_i)$ using the $p$-norm of the degree sequence of the underlying graph of a relation. While old bound in step 3 count "claws $\in$" in the underlying graph, we proposed $\textit{ambidextrous}$ bounds that count "claw pairs ${\ni}\!{-}\!{\in}$". The new bounds are provably not looser and empirically tighter: they overestimate by $x^{3/4}$ times when the old bounds overestimate by $x$ times. An example is counting friend triples in the $\texttt{com-Youtube}$ dataset, the best dexterous bound is $1.2 \cdot 10^9$, the best ambidextrous bound is $5.1 \cdot 10^8$, and the actual cardinality is $1.8 \cdot 10^7$.
