Table of Contents
Fetching ...

Pessimistic Cardinality Estimation

Mahmoud Abo Khamis, Kyle Deeds, Dan Olteanu, Dan Suciu

TL;DR

Pessimistic Cardinality Estimation (PCE) addresses the problem of bounding query output sizes without full computation by providing guaranteed upper bounds instead of point estimates. The paper surveys a spectrum of PCE methods, grounded in degree sequences and information-theoretic inequalities, including the AGM bound, Chain Bound, Polymatroid Bound (PolyB), and Degree Sequence Bound (DSB), and explains how these bounds can be computed, combined, and compressed for practicality. It discusses practical considerations such as statistics selection, offline computation, conditional statistics, histograms, and handling of boolean predicates, as well as the tradeoffs between bound tightness, computation time, and compositionality. The work highlights the safety and composability advantages of PCE over traditional and ML-based estimators, while also outlining open questions about empirical evaluation, incremental updates, and applicability to cyclic queries. Overall, PCE provides a theoretically grounded, modular framework for safe cardinality bounding with potential to influence query optimization and resource planning.

Abstract

Cardinality Estimation is to estimate the size of the output of a query without computing it, by using only statistics on the input relations. Existing estimators try to return an unbiased estimate of the cardinality: this is notoriously difficult. A new class of estimators have been proposed recently, called "pessimistic estimators", which compute a guaranteed upper bound on the query output. Two recent advances have made pessimistic estimators practical. The first is the recent observation that degree sequences of the input relations can be used to compute query upper bounds. The second is a long line of theoretical results that have developed the use of information theoretic inequalities for query upper bounds. This paper is a short overview of pessimistic cardinality estimators, contrasting them with traditional estimators.

Pessimistic Cardinality Estimation

TL;DR

Pessimistic Cardinality Estimation (PCE) addresses the problem of bounding query output sizes without full computation by providing guaranteed upper bounds instead of point estimates. The paper surveys a spectrum of PCE methods, grounded in degree sequences and information-theoretic inequalities, including the AGM bound, Chain Bound, Polymatroid Bound (PolyB), and Degree Sequence Bound (DSB), and explains how these bounds can be computed, combined, and compressed for practicality. It discusses practical considerations such as statistics selection, offline computation, conditional statistics, histograms, and handling of boolean predicates, as well as the tradeoffs between bound tightness, computation time, and compositionality. The work highlights the safety and composability advantages of PCE over traditional and ML-based estimators, while also outlining open questions about empirical evaluation, incremental updates, and applicability to cyclic queries. Overall, PCE provides a theoretically grounded, modular framework for safe cardinality bounding with potential to influence query optimization and resource planning.

Abstract

Cardinality Estimation is to estimate the size of the output of a query without computing it, by using only statistics on the input relations. Existing estimators try to return an unbiased estimate of the cardinality: this is notoriously difficult. A new class of estimators have been proposed recently, called "pessimistic estimators", which compute a guaranteed upper bound on the query output. Two recent advances have made pessimistic estimators practical. The first is the recent observation that degree sequences of the input relations can be used to compute query upper bounds. The second is a long line of theoretical results that have developed the use of information theoretic inequalities for query upper bounds. This paper is a short overview of pessimistic cardinality estimators, contrasting them with traditional estimators.

Paper Structure

This paper contains 14 sections, 1 theorem, 39 equations, 3 figures.

Key Result

Lemma C.1

Let $\bm a, \bm b$ be two, non-negative sequences, and assume that $\bm b$ is non-decreasing (meaning, $b_1 \geq b_2 \geq \cdots$). Let $\bm A \stackrel{\text{def}}{=} \Sigma \bm a$, and let $\bm A'$ be such that $\bm A \leq \bm A'$. Then, the following holds, where $\bm a' \stackrel{\text{def}}{=}

Figures (3)

  • Figure 1: Example of Degree Sequences
  • Figure 2: The degree sequence of $\text{CastInfo}$ from the JOB benchmark DBLP:journals/pvldb/LeisGMBK015. Its cardinality is $36\cdot 10^6$, and it has $4\cdot 10^6$ distinct actor IDs, with degrees ranging from $10^4$ to $1$. The maximum degree is that of Bob Barker, who hosted the CBS show The Price Is Right from 1972 to 2007 and also Truth or Consequences from 1956 to 1975.
  • Figure 3: Illustration of the advanced compression in DSB: it shows the PDFs $\bm a$, $\bm a'$, $\bm a"$ and their CDFs $\bm A, \bm A', \bm A"$ (see text).

Theorems & Definitions (1)

  • Lemma C.1