Table of Contents
Fetching ...

CUBE: A Cardinality Estimator Based on Neural CDF

Xiao Yan, Tiezheng Nie, Boyang Fang, Derong Shen, Kou Yue, Yu Ge

TL;DR

This work tackles the challenge of accurate, fast, and stable cardinality estimation for database queries. It introduces CUBE, a CDF-based estimator that constructs a tractable multivariate CDF from univariate CDFs and a mixing tensor, enabling range-query cardinalities to be computed without sampling or integration. The framework supports single-table and multi-table joins, with inference acceleration via merged calculations and a global join-based model, and it offers formal predictability guarantees (monotonicity, validity, consistency, stability). Empirical results show CUBE achieves higher accuracy and significantly lower latency than state-of-the-art data-driven estimators, including strong tail performance and excellent scalability to high dimensions. The work suggests promising directions for hybrid training and GPU-accelerated deployment in practical DB systems.

Abstract

Modern database optimizer relies on cardinality estimator, whose accuracy directly affects the optimizer's ability to choose an optimal execution plan. Recent work on data-driven methods has leveraged probabilistic models to achieve higher estimation accuracy, but these approaches cannot guarantee low inference latency at the same time and neglect scalability. As data dimensionality grows, optimization time can even exceed actual query execution time. Furthermore, inference with probabilistic models by sampling or integration procedures unpredictable estimation result and violate stability, which brings unstable performance with query execution and make database tuning hard for database users. In this paper, we propose a novel approach to cardinality estimation based on cumulative distribution function(CDF), which calculates range query cardinality without sampling or integration, ensuring accurate and predictable estimation results. With inference acceleration by merging calculations, we can achieve fast and nearly constant inference speed while maintaining high accuracy, even as dimensionality increases, which is over 10x faster than current state-of-the-art data-driven cardinality estimator. This demonstrates its excellent dimensional scalability, making it well-suited for real-world database applications.

CUBE: A Cardinality Estimator Based on Neural CDF

TL;DR

This work tackles the challenge of accurate, fast, and stable cardinality estimation for database queries. It introduces CUBE, a CDF-based estimator that constructs a tractable multivariate CDF from univariate CDFs and a mixing tensor, enabling range-query cardinalities to be computed without sampling or integration. The framework supports single-table and multi-table joins, with inference acceleration via merged calculations and a global join-based model, and it offers formal predictability guarantees (monotonicity, validity, consistency, stability). Empirical results show CUBE achieves higher accuracy and significantly lower latency than state-of-the-art data-driven estimators, including strong tail performance and excellent scalability to high dimensions. The work suggests promising directions for hybrid training and GPU-accelerated deployment in practical DB systems.

Abstract

Modern database optimizer relies on cardinality estimator, whose accuracy directly affects the optimizer's ability to choose an optimal execution plan. Recent work on data-driven methods has leveraged probabilistic models to achieve higher estimation accuracy, but these approaches cannot guarantee low inference latency at the same time and neglect scalability. As data dimensionality grows, optimization time can even exceed actual query execution time. Furthermore, inference with probabilistic models by sampling or integration procedures unpredictable estimation result and violate stability, which brings unstable performance with query execution and make database tuning hard for database users. In this paper, we propose a novel approach to cardinality estimation based on cumulative distribution function(CDF), which calculates range query cardinality without sampling or integration, ensuring accurate and predictable estimation results. With inference acceleration by merging calculations, we can achieve fast and nearly constant inference speed while maintaining high accuracy, even as dimensionality increases, which is over 10x faster than current state-of-the-art data-driven cardinality estimator. This demonstrates its excellent dimensional scalability, making it well-suited for real-world database applications.

Paper Structure

This paper contains 30 sections, 7 theorems, 28 equations, 11 figures, 5 tables.

Key Result

theorem 1

Given a query region $\Omega = [l_1, u_1] \times \cdots \times [l_d, u_d]$ on a random variable $\mathbf{X} = (X_1, X_2, \cdots, X_d)$ with CDF denoted as $F(\mathbf x) = \Pr(X_1\le x_1,\, X_2\le x_2,\ \ldots,\ X_n\le x_n)$. Let $\mathbf s=(s_1,s_2,\dots,s_d)$ , $\left | s \right | = s_1+s_2+\dots+s

Figures (11)

  • Figure 1: Multivariate CDF obtained by combining univariate CDFs.
  • Figure 2: CardEst with CDF on 2-dimensional data
  • Figure 3: continuity correction on 1-dimensional data.
  • Figure 4: (a) inference procedure before optimization; (b) inference procedure after merged-calculation.
  • Figure 5: Global model for multi-table CardEst, building on full outer join of table $A,B,C,D,E,F$ with subset query involving table $A,E,F$
  • ...and 6 more figures

Theorems & Definitions (7)

  • theorem 1
  • corollary 1
  • theorem 2
  • corollary 2
  • theorem 3
  • theorem 4
  • theorem 5