Table of Contents
Fetching ...

ACE: A Cardinality Estimator for Set-Valued Queries

Yufan Sheng, Xin Cao, Kaiqi Zhao, Yixiang Fang, Jianzhong Qi, Wenjie Zhang, Christian S. Jensen

TL;DR

ACE introduces an attention-based cardinality estimator for queries over set-valued data, addressing the limitations of frequency-biased and independence-assuming estimators. It combines a distillation-based data encoder that compresses the dataset into a compact matrix with a cross-/self-attention-based query analyzer and an attention-pooled fixed-size representation to predict query cardinalities via a regression head. Two offline training losses, $L_{CE}$ and $L_{MMD}$, enable the encoder to preserve graph-like element-set relations while the analyzer leverages both data and workload information to capture element correlations. Experiments on real datasets show ACE achieves substantially lower $Q$-errors and faster end-to-end runtimes than state-of-the-art baselines, including in dynamic data settings, demonstrating practical value for query optimization in modern systems.

Abstract

Cardinality estimation is a fundamental functionality in database systems. Most existing cardinality estimators focus on handling predicates over numeric or categorical data. They have largely omitted an important data type, set-valued data, which frequently occur in contemporary applications such as information retrieval and recommender systems. The few existing estimators for such data either favor high-frequency elements or rely on a partial independence assumption, which limits their practical applicability. We propose ACE, an Attention-based Cardinality Estimator for estimating the cardinality of queries over set-valued data. We first design a distillation-based data encoder to condense the dataset into a compact matrix. We then design an attention-based query analyzer to capture correlations among query elements. To handle variable-sized queries, a pooling module is introduced, followed by a regression model (MLP) to generate final cardinality estimates. We evaluate ACE on three datasets with varying query element distributions, demonstrating that ACE outperforms the state-of-the-art competitors in terms of both accuracy and efficiency.

ACE: A Cardinality Estimator for Set-Valued Queries

TL;DR

ACE introduces an attention-based cardinality estimator for queries over set-valued data, addressing the limitations of frequency-biased and independence-assuming estimators. It combines a distillation-based data encoder that compresses the dataset into a compact matrix with a cross-/self-attention-based query analyzer and an attention-pooled fixed-size representation to predict query cardinalities via a regression head. Two offline training losses, and , enable the encoder to preserve graph-like element-set relations while the analyzer leverages both data and workload information to capture element correlations. Experiments on real datasets show ACE achieves substantially lower -errors and faster end-to-end runtimes than state-of-the-art baselines, including in dynamic data settings, demonstrating practical value for query optimization in modern systems.

Abstract

Cardinality estimation is a fundamental functionality in database systems. Most existing cardinality estimators focus on handling predicates over numeric or categorical data. They have largely omitted an important data type, set-valued data, which frequently occur in contemporary applications such as information retrieval and recommender systems. The few existing estimators for such data either favor high-frequency elements or rely on a partial independence assumption, which limits their practical applicability. We propose ACE, an Attention-based Cardinality Estimator for estimating the cardinality of queries over set-valued data. We first design a distillation-based data encoder to condense the dataset into a compact matrix. We then design an attention-based query analyzer to capture correlations among query elements. To handle variable-sized queries, a pooling module is introduced, followed by a regression model (MLP) to generate final cardinality estimates. We evaluate ACE on three datasets with varying query element distributions, demonstrating that ACE outperforms the state-of-the-art competitors in terms of both accuracy and efficiency.

Paper Structure

This paper contains 28 sections, 11 equations, 11 figures, 16 tables.

Figures (11)

  • Figure 1: Overview of ACE.
  • Figure 2: Graph construction approaches.
  • Figure 3: Distillation model.
  • Figure 4: Hybrid attention framework.
  • Figure 5: Attention pooling.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Definition 1: Set-Valued Query