Table of Contents
Fetching ...

ZeroCard: Cardinality Estimation with Zero Dependence on Target Databases -- No Data, No Query, No Retraining

Xianghong Xu, Rong Kang, Xiao He, Lei Zhang, Jianjun Chen, Tieying Zhang

TL;DR

ZeroCard addresses the practical limitations of data- and log-dependent learned cardinality estimators by proposing a semantics-driven approach that uses schema semantics and a template-agnostic predicate representation. It is pretrained on a large, semantically annotated tabular corpus and deployed off-the-shelf with frozen parameters on unseen databases, eliminating data access and retraining. The architecture combines semantics embeddings via PLMs, a mixture-of-experts distribution predictor, and unified predicate representations to predict distributions and cardinalities, achieving zero-dependence deployment and strong practical efficiency for tasks like index recommendation. While showing competitive performance and substantial deployment advantages, it remains limited to single-table queries with up to eight predicates and relies on accurate semantic alignment between schema and data. The work opens avenues for extending to joins and richer operators and motivates broader adoption of semantics-driven cardinality estimation in real DBMSs.

Abstract

Cardinality estimation is a fundamental task in database systems and plays a critical role in query optimization. Despite significant advances in learning-based cardinality estimation methods, most existing approaches remain difficult to generalize to new datasets due to their strong dependence on raw data or queries, thus limiting their practicality in real scenarios. To overcome these challenges, we argue that semantics in the schema may benefit cardinality estimation, and leveraging such semantics may alleviate these dependencies. To this end, we introduce ZeroCard, the first semantics-driven cardinality estimation method that can be applied without any dependence on raw data access, query logs, or retraining on the target database. Specifically, we propose to predict data distributions using schema semantics, thereby avoiding raw data dependence. Then, we introduce a query template-agnostic representation method to alleviate query dependence. Finally, we construct a large-scale query dataset derived from real-world tables and pretrain ZeroCard on it, enabling it to learn cardinality from schema semantics and predicate representations. After pretraining, ZeroCard's parameters can be frozen and applied in an off-the-shelf manner. We conduct extensive experiments to demonstrate the distinct advantages of ZeroCard and show its practical applications in query optimization. Its zero-dependence property significantly facilitates deployment in real-world scenarios.

ZeroCard: Cardinality Estimation with Zero Dependence on Target Databases -- No Data, No Query, No Retraining

TL;DR

ZeroCard addresses the practical limitations of data- and log-dependent learned cardinality estimators by proposing a semantics-driven approach that uses schema semantics and a template-agnostic predicate representation. It is pretrained on a large, semantically annotated tabular corpus and deployed off-the-shelf with frozen parameters on unseen databases, eliminating data access and retraining. The architecture combines semantics embeddings via PLMs, a mixture-of-experts distribution predictor, and unified predicate representations to predict distributions and cardinalities, achieving zero-dependence deployment and strong practical efficiency for tasks like index recommendation. While showing competitive performance and substantial deployment advantages, it remains limited to single-table queries with up to eight predicates and relies on accurate semantic alignment between schema and data. The work opens avenues for extending to joins and richer operators and motivates broader adoption of semantics-driven cardinality estimation in real DBMSs.

Abstract

Cardinality estimation is a fundamental task in database systems and plays a critical role in query optimization. Despite significant advances in learning-based cardinality estimation methods, most existing approaches remain difficult to generalize to new datasets due to their strong dependence on raw data or queries, thus limiting their practicality in real scenarios. To overcome these challenges, we argue that semantics in the schema may benefit cardinality estimation, and leveraging such semantics may alleviate these dependencies. To this end, we introduce ZeroCard, the first semantics-driven cardinality estimation method that can be applied without any dependence on raw data access, query logs, or retraining on the target database. Specifically, we propose to predict data distributions using schema semantics, thereby avoiding raw data dependence. Then, we introduce a query template-agnostic representation method to alleviate query dependence. Finally, we construct a large-scale query dataset derived from real-world tables and pretrain ZeroCard on it, enabling it to learn cardinality from schema semantics and predicate representations. After pretraining, ZeroCard's parameters can be frozen and applied in an off-the-shelf manner. We conduct extensive experiments to demonstrate the distinct advantages of ZeroCard and show its practical applications in query optimization. Its zero-dependence property significantly facilitates deployment in real-world scenarios.

Paper Structure

This paper contains 29 sections, 12 equations, 6 figures, 3 tables, 2 algorithms.

Figures (6)

  • Figure 1: Architecture of ZeroCard. When pre-trained on a large-scale dataset, ZeroCard can be directly applied to unseen new databases, requiring no raw data, no query logs, or no retraining.
  • Figure 2: Structure of distribution prediction MoE layer with $k=2$. The gating network dynamically activates specific experts, each individually modeling different distributions.
  • Figure 3: Ablation study: performance of ZeroCard and its variants.
  • Figure 4: Performance of ZeroCard under different hyperparameter settings, where $h$ is the data distribution dimension size and $m$ is the number of experts.
  • Figure 5: Performance of query-driven methods across training epochs, compared with ZeroCard.
  • ...and 1 more figures