ACE: A Cardinality Estimator for Set-Valued Queries
Yufan Sheng, Xin Cao, Kaiqi Zhao, Yixiang Fang, Jianzhong Qi, Wenjie Zhang, Christian S. Jensen
TL;DR
ACE introduces an attention-based cardinality estimator for queries over set-valued data, addressing the limitations of frequency-biased and independence-assuming estimators. It combines a distillation-based data encoder that compresses the dataset into a compact matrix with a cross-/self-attention-based query analyzer and an attention-pooled fixed-size representation to predict query cardinalities via a regression head. Two offline training losses, $L_{CE}$ and $L_{MMD}$, enable the encoder to preserve graph-like element-set relations while the analyzer leverages both data and workload information to capture element correlations. Experiments on real datasets show ACE achieves substantially lower $Q$-errors and faster end-to-end runtimes than state-of-the-art baselines, including in dynamic data settings, demonstrating practical value for query optimization in modern systems.
Abstract
Cardinality estimation is a fundamental functionality in database systems. Most existing cardinality estimators focus on handling predicates over numeric or categorical data. They have largely omitted an important data type, set-valued data, which frequently occur in contemporary applications such as information retrieval and recommender systems. The few existing estimators for such data either favor high-frequency elements or rely on a partial independence assumption, which limits their practical applicability. We propose ACE, an Attention-based Cardinality Estimator for estimating the cardinality of queries over set-valued data. We first design a distillation-based data encoder to condense the dataset into a compact matrix. We then design an attention-based query analyzer to capture correlations among query elements. To handle variable-sized queries, a pooling module is introduced, followed by a regression model (MLP) to generate final cardinality estimates. We evaluate ACE on three datasets with varying query element distributions, demonstrating that ACE outperforms the state-of-the-art competitors in terms of both accuracy and efficiency.
