Table of Contents
Fetching ...

AdaNDV: Adaptive Number of Distinct Value Estimation via Learning to Select and Fuse Estimators

Xianghong Xu, Tieying Zhang, Xiao He, Haoyang Li, Rong Kang, Shuai Wang, Linhui Xu, Zhimin Liang, Shangyu Luo, Lei Zhang, Jianjun Chen

TL;DR

AdaNDV addresses NDV estimation by learning to select and fuse existing estimators rather than directly predicting the ground truth. It splits base estimators into overestimation and underestimation groups, trains ranking-based selectors for each, and fuses the chosen estimators with learned weights in the log domain to produce a final estimate $\hat{D}$. The approach leverages frequency-profile features from samples and optimizes a multi-term objective $\mathcal{L}_{\textsc{AdaNDV}}$ to balance selection and fusion quality, validated on a large TabLib-based dataset with tens of thousands of columns. Results show AdaNDV consistently outperforms traditional, hybrid, and learned estimators, demonstrating robustness to distribution shifts and sampling rates and offering practical benefits for database cardinality estimation.

Abstract

Estimating the Number of Distinct Values (NDV) is fundamental for numerous data management tasks, especially within database applications. However, most existing works primarily focus on introducing new statistical or learned estimators, while identifying the most suitable estimator for a given scenario remains largely unexplored. Therefore, we propose AdaNDV, a learned method designed to adaptively select and fuse existing estimators to address this issue. Specifically, (1) we propose to use learned models to distinguish between overestimated and underestimated estimators and then select appropriate estimators from each category. This strategy provides a complementary perspective by integrating overestimations and underestimations for error correction, thereby improving the accuracy of NDV estimation. (2) To further integrate the estimation results, we introduce a novel fusion approach that employs a learned model to predict the weights of the selected estimators and then applies a weighted sum to merge them. By combining these strategies, the proposed AdaNDV fundamentally distinguishes itself from previous works that directly estimate NDV. Moreover, extensive experiments conducted on real-world datasets, with the number of individual columns being several orders of magnitude larger than in previous studies, demonstrate the superior performance of our method.

AdaNDV: Adaptive Number of Distinct Value Estimation via Learning to Select and Fuse Estimators

TL;DR

AdaNDV addresses NDV estimation by learning to select and fuse existing estimators rather than directly predicting the ground truth. It splits base estimators into overestimation and underestimation groups, trains ranking-based selectors for each, and fuses the chosen estimators with learned weights in the log domain to produce a final estimate . The approach leverages frequency-profile features from samples and optimizes a multi-term objective to balance selection and fusion quality, validated on a large TabLib-based dataset with tens of thousands of columns. Results show AdaNDV consistently outperforms traditional, hybrid, and learned estimators, demonstrating robustness to distribution shifts and sampling rates and offering practical benefits for database cardinality estimation.

Abstract

Estimating the Number of Distinct Values (NDV) is fundamental for numerous data management tasks, especially within database applications. However, most existing works primarily focus on introducing new statistical or learned estimators, while identifying the most suitable estimator for a given scenario remains largely unexplored. Therefore, we propose AdaNDV, a learned method designed to adaptively select and fuse existing estimators to address this issue. Specifically, (1) we propose to use learned models to distinguish between overestimated and underestimated estimators and then select appropriate estimators from each category. This strategy provides a complementary perspective by integrating overestimations and underestimations for error correction, thereby improving the accuracy of NDV estimation. (2) To further integrate the estimation results, we introduce a novel fusion approach that employs a learned model to predict the weights of the selected estimators and then applies a weighted sum to merge them. By combining these strategies, the proposed AdaNDV fundamentally distinguishes itself from previous works that directly estimate NDV. Moreover, extensive experiments conducted on real-world datasets, with the number of individual columns being several orders of magnitude larger than in previous studies, demonstrate the superior performance of our method.

Paper Structure

This paper contains 29 sections, 12 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: Evaluation of fourteen statistical estimators on 25,159 test columns, where the bar represents the proportion of each estimator achieving the optimality (lowest estimation error) among the fourteen estimators. No single estimator achieves optimality on more than 40% of test cases.
  • Figure 2: Overview of AdaNDV on NDV estimation including training and inference data pipelines.
  • Figure 3: Intuition behind leveraging the properties of overestimation and underestimation.
  • Figure 4: Error distribution of learned estimators on the test set. The violin plot is in blue. The boxplot is in black, the gray box contains 50% of data points, and the white line in the gray box represents the median. We exclude the SO estimator due to its extremely large mean error.
  • Figure 5: Performance on mean and 99% percentile of q-error of AdaNDV with different hyperparameters.
  • ...and 3 more figures