Table of Contents
Fetching ...

Join Size Bounds using Lp-Norms on Degree Sequences

Mahmoud Abo Khamis, Vasileios Nakos, Dan Olteanu, Dan Suciu

TL;DR

This paper tackles join-size estimation by introducing upper bounds based on $\ell_p$-norms of degree sequences, grounded in information-theoretic entropy inequalities. It generalizes prior bounds (AGM, PANDA) to arbitrary $\ell_p$ norms, yielding significantly tighter bounds and an evaluation algorithm whose runtime is exponential in the query size but scales with the bound itself; the approach also yields a dual LP formulation for computing the bound. The key theoretical contributions connect $\ell_p$-norm statistics to entropic quantities, prove that the almost-entropic (entropic) and polymatroid bounds coincide under suitable cones, and show tightness for simple degree sequences via normal databases. Empirically, the $\ell_p$-norm bounds outperform traditional estimators on cyclic and acyclic queries (SNAP JOB benchmarks), suggesting practical impact for pessimistic cardinality estimation and query optimization. The work also clarifies relations to prior methods and provides a pathway to integrating these tighter bounds into real systems.

Abstract

Estimating the output size of a query is a fundamental yet longstanding problem in database query processing. Traditional cardinality estimators used by database systems can routinely underestimate the true output size by orders of magnitude, which leads to significant system performance penalty. Recently, upper bounds have been proposed that are based on information inequalities and incorporate sizes and max-degrees from input relations, yet they their main benefit is limited to cyclic queries, because they degenerate to rather trivial formulas on acyclic queries. We introduce a significant extension of the upper bounds, by incorporating $\ell_p$-norms of the degree sequences of join attributes. Our bounds are significantly lower than previously known bounds, even when applied to acyclic queries. These bounds are also based on information theory, they come with a matching query evaluation algorithm, are computable in exponential time in the query size, and are provably tight when all degrees are "simple".

Join Size Bounds using Lp-Norms on Degree Sequences

TL;DR

This paper tackles join-size estimation by introducing upper bounds based on -norms of degree sequences, grounded in information-theoretic entropy inequalities. It generalizes prior bounds (AGM, PANDA) to arbitrary norms, yielding significantly tighter bounds and an evaluation algorithm whose runtime is exponential in the query size but scales with the bound itself; the approach also yields a dual LP formulation for computing the bound. The key theoretical contributions connect -norm statistics to entropic quantities, prove that the almost-entropic (entropic) and polymatroid bounds coincide under suitable cones, and show tightness for simple degree sequences via normal databases. Empirically, the -norm bounds outperform traditional estimators on cyclic and acyclic queries (SNAP JOB benchmarks), suggesting practical impact for pessimistic cardinality estimation and query optimization. The work also clarifies relations to prior methods and provides a pathway to integrating these tighter bounds into real systems.

Abstract

Estimating the output size of a query is a fundamental yet longstanding problem in database query processing. Traditional cardinality estimators used by database systems can routinely underestimate the true output size by orders of magnitude, which leads to significant system performance penalty. Recently, upper bounds have been proposed that are based on information inequalities and incorporate sizes and max-degrees from input relations, yet they their main benefit is limited to cyclic queries, because they degenerate to rather trivial formulas on acyclic queries. We introduce a significant extension of the upper bounds, by incorporating -norms of the degree sequences of join attributes. Our bounds are significantly lower than previously known bounds, even when applied to acyclic queries. These bounds are also based on information theory, they come with a matching query evaluation algorithm, are computable in exponential time in the query size, and are provably tight when all degrees are "simple".
Paper Structure (39 sections, 19 theorems, 140 equations, 2 figures)

This paper contains 39 sections, 19 theorems, 140 equations, 2 figures.

Key Result

Theorem 1.1

Let $Q$ be a full conjunctive query eq:full:cq, $\bm U_i, \bm V_i\subseteq \bm X$ be sets of variables, for $i\in[s]$, and suppose that the following information inequality is valid for all entropic vectors $\bm h$ with variables $\bm X$: where $w_i \geq 0$, and $p_i \in (0,\infty]$, for all $i\in[s]$. Assume that each conditional $(\bm V_i|\bm U_i)$ in eq:ii:lp is guarded by some relation $R_{j_

Figures (2)

  • Figure 1: Ratios of various bounds and estimates to the true cardinality of the query output for each of the 33 join queries in the JOB benchmark. Queries 29 and 31 were not computable by DuckDB due to their large output size.
  • Figure 2: A lattice of closed sets and the polymatroid from ZhangY98 defined on the lattice.

Theorems & Definitions (44)

  • Theorem 1.1
  • Theorem 1.2: Informal
  • Example 2.1
  • Example 2.2
  • Example 2.3
  • Lemma 2.4
  • proof
  • Lemma 2.5
  • proof
  • Theorem 2.6
  • ...and 34 more