CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases
Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan
TL;DR
CardBench tackles the need for a diverse, large-scale benchmark for learned cardinality estimation in relational databases. It introduces CardBench, a benchmark with thousands of queries across 20 real-world datasets and two training data configurations, plus an open-source pipeline for statistics, query generation, and annotated query graphs. The study evaluates Graph Neural Network (GNN) and Graph Transformer approaches under instance-based, zero-shot, and fine-tuned regimes, finding that zero-shot generalization is challenging for joins but fine-tuning with modest data can achieve accuracy comparable to instance-based models while reducing training overhead. By providing extensive datasets and tooling, CardBench enables systematic progress in pre-trained CE methods and invites the ML and DB communities to extend to more complex workloads.
Abstract
Cardinality estimation is crucial for enabling high query performance in relational databases. Recently learned cardinality estimation models have been proposed to improve accuracy but there is no systematic benchmark or datasets which allows researchers to evaluate the progress made by new learned approaches and even systematically develop new learned approaches. In this paper, we are releasing a benchmark, containing thousands of queries over 20 distinct real-world databases for learned cardinality estimation. In contrast to other initial benchmarks, our benchmark is much more diverse and can be used for training and testing learned models systematically. Using this benchmark, we explored whether learned cardinality estimation can be transferred to an unseen dataset in a zero-shot manner. We trained GNN-based and transformer-based models to study the problem in three setups: 1-) instance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while we get promising results for zero-shot cardinality estimation on simple single table queries; as soon as we add joins, the accuracy drops. However, we show that with fine-tuning, we can still utilize pre-trained models for cardinality estimation, significantly reducing training overheads compared to instance specific models. We are open sourcing our scripts to collect statistics, generate queries and training datasets to foster more extensive research, also from the ML community on the important problem of cardinality estimation and in particular improve on recent directions such as pre-trained cardinality estimation.
