Table of Contents
Fetching ...

Is it Bigger than a Breadbox: Efficient Cardinality Estimation for Real World Workloads

Zixuan Yi, Sami Abu-el-Haija, Yawen Wang, Teja Vemparala, Yannis Chronis, Yu Gan, Michael Burrows, Carsten Binnig, Bryan Perozzi, Ryan Marcus, Fatma Ozcan

TL;DR

The paper tackles efficient cardinality estimation for real-world workloads by introducing LiteCard, an online, graph-pattern–localized estimator. It learns many small, pattern-specific models online, organized in a three-level hierarchy and keyed by hashed query-graph patterns, enabling cold-start operation with negligible update and inference overhead. Empirical results on IMDb JOB-lite show LiteCard delivering a $7.5$ minute speedup (over $30\%$) with minimal overhead, and a substantial reduction in End-to-End time and Q-Error relative to PostgreSQL and other learned estimators, while maintaining scalability as joins increase. Integrating LiteCard into PostgreSQL via planner hooks demonstrates practical viability and highlights a favorable Pareto frontier for learned cardinality estimators in real systems.

Abstract

DB engines produce efficient query execution plans by relying on cost models. Practical implementations estimate cardinality of queries using heuristics, with magic numbers tuned to improve average performance on benchmarks. Empirically, estimation error significantly grows with query complexity. Alternatively, learning-based estimators offer improved accuracy, but add operational complexity preventing their adoption in-practice. Recognizing that query workloads contain highly repetitive subquery patterns, we learn many simple regressors online, each localized to a pattern. The regressor corresponding to a pattern can be randomly-accessed using hash of graph structure of the subquery. Our method has negligible overhead and competes with SoTA learning-based approaches on error metrics. Further, amending PostgreSQL with our method achieves notable accuracy and runtime improvements over traditional methods and drastically reduces operational costs compared to other learned cardinality estimators, thereby offering the most practical and efficient solution on the Pareto frontier. Concretely, simulating JOB-lite workload on IMDb speeds-up execution by 7.5 minutes (>30%) while incurring only 37 seconds overhead for online learning.

Is it Bigger than a Breadbox: Efficient Cardinality Estimation for Real World Workloads

TL;DR

The paper tackles efficient cardinality estimation for real-world workloads by introducing LiteCard, an online, graph-pattern–localized estimator. It learns many small, pattern-specific models online, organized in a three-level hierarchy and keyed by hashed query-graph patterns, enabling cold-start operation with negligible update and inference overhead. Empirical results on IMDb JOB-lite show LiteCard delivering a minute speedup (over ) with minimal overhead, and a substantial reduction in End-to-End time and Q-Error relative to PostgreSQL and other learned estimators, while maintaining scalability as joins increase. Integrating LiteCard into PostgreSQL via planner hooks demonstrates practical viability and highlights a favorable Pareto frontier for learned cardinality estimators in real systems.

Abstract

DB engines produce efficient query execution plans by relying on cost models. Practical implementations estimate cardinality of queries using heuristics, with magic numbers tuned to improve average performance on benchmarks. Empirically, estimation error significantly grows with query complexity. Alternatively, learning-based estimators offer improved accuracy, but add operational complexity preventing their adoption in-practice. Recognizing that query workloads contain highly repetitive subquery patterns, we learn many simple regressors online, each localized to a pattern. The regressor corresponding to a pattern can be randomly-accessed using hash of graph structure of the subquery. Our method has negligible overhead and competes with SoTA learning-based approaches on error metrics. Further, amending PostgreSQL with our method achieves notable accuracy and runtime improvements over traditional methods and drastically reduces operational costs compared to other learned cardinality estimators, thereby offering the most practical and efficient solution on the Pareto frontier. Concretely, simulating JOB-lite workload on IMDb speeds-up execution by 7.5 minutes (>30%) while incurring only 37 seconds overhead for online learning.

Paper Structure

This paper contains 32 sections, 3 theorems, 29 equations, 13 figures, 6 tables.

Key Result

Theorem 1

Any feature set $\mathcal{H} \subseteq \mathcal{A}$ can induce a canonical node ordering. Specifically, such that $\pi^\mathcal{H}(\mathcal{G})$ and $\pi^\mathcal{H}(\mathcal{G}')$ can be used to align the featured DAGs, and sparse re-ordering (adjacency) matrix $\mathbf{A}^{\pi^\mathcal{H}(\mathcal{G})} \in \{0, 1\}^{n \times n}$ shuffles rows of its multiplicand according to ordering defined by

Figures (13)

  • Figure 1: DAG corresponding to SQL in Eq. \ref{['eq:sample_sql']}
  • Figure 2: t-SNE visualizations of IMDB 5K workload. (Left) Every subquery is a point (with 5% opacity). Due to $K^\mathcal{H}_\mathcal{F}(G, G') = \mathbf{1}_{[h^\mathcal{H}(G) = h^\mathcal{H}(G')]} \times .$, subquery DAGs that are isomorphic (per $\mathcal{H})$ are cleanly clustered, painting a darker region. The point color represents cardinality of the query (from red to blue). We choose 6 clusters (by stratified sampling) and circle them with colors. (Right) we recompute t-SNE within each colored cluster. The original dimension of every right plot equals the number of @ nodes in the graph above it, which renders the subquery pattern graph. Finally, points are colored using their ground-truth (normalized) cardinalities.
  • Figure 3: Comparing different techniques on the IMDb database on multiple metrics. Lower and to the left is better. Note the x-axis log scale.
  • Figure 4: Query Optimization Time Comparison per query on the IMDb dataset. Note the log scale on the Y-Axis.
  • Figure 5: Comulative Q-Error percentile on the IMDb workload VS size of set $\mathcal{D}^\mathcal{H}_\mathcal{G}$ (§ \ref{['sec:online_algorithm']})
  • ...and 8 more figures

Theorems & Definitions (6)

  • Definition 1
  • Definition 2
  • Definition 3
  • Theorem 1
  • Theorem 2
  • Theorem 3