Join Size Bounds using Lp-Norms on Degree Sequences
Mahmoud Abo Khamis, Vasileios Nakos, Dan Olteanu, Dan Suciu
TL;DR
This paper tackles join-size estimation by introducing upper bounds based on $\ell_p$-norms of degree sequences, grounded in information-theoretic entropy inequalities. It generalizes prior bounds (AGM, PANDA) to arbitrary $\ell_p$ norms, yielding significantly tighter bounds and an evaluation algorithm whose runtime is exponential in the query size but scales with the bound itself; the approach also yields a dual LP formulation for computing the bound. The key theoretical contributions connect $\ell_p$-norm statistics to entropic quantities, prove that the almost-entropic (entropic) and polymatroid bounds coincide under suitable cones, and show tightness for simple degree sequences via normal databases. Empirically, the $\ell_p$-norm bounds outperform traditional estimators on cyclic and acyclic queries (SNAP JOB benchmarks), suggesting practical impact for pessimistic cardinality estimation and query optimization. The work also clarifies relations to prior methods and provides a pathway to integrating these tighter bounds into real systems.
Abstract
Estimating the output size of a query is a fundamental yet longstanding problem in database query processing. Traditional cardinality estimators used by database systems can routinely underestimate the true output size by orders of magnitude, which leads to significant system performance penalty. Recently, upper bounds have been proposed that are based on information inequalities and incorporate sizes and max-degrees from input relations, yet they their main benefit is limited to cyclic queries, because they degenerate to rather trivial formulas on acyclic queries. We introduce a significant extension of the upper bounds, by incorporating $\ell_p$-norms of the degree sequences of join attributes. Our bounds are significantly lower than previously known bounds, even when applied to acyclic queries. These bounds are also based on information theory, they come with a matching query evaluation algorithm, are computable in exponential time in the query size, and are provably tight when all degrees are "simple".
