Table of Contents
Fetching ...

Learned Cardinalities: Estimating Correlated Joins with Deep Learning

Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, Alfons Kemper

TL;DR

This work tackles cardinality estimation for query optimization by introducing MSCN, a multi-set convolutional network that represents queries as sets of tables, joins, and predicates. By applying per-element MLPs and averaging within each set, MSCN achieves a permutation-invariant, compact model that can incorporate materialized sample bitmaps to learn join-crossing correlations and handle 0-tuple scenarios. The approach is trained on synthetically generated queries with labels derived from actual data and augmented with sampling signals, and evaluated on the IMDb dataset where it competes with and often surpasses state-of-the-art sampling methods while using far less data. The results demonstrate robustness to challenging cases and highlight promising directions for extending the model to more complex predicates, uncertainty estimation, and update handling, offering a feasible ML-based alternative to traditional cardinality estimation techniques.

Abstract

We describe a new deep learning approach to cardinality estimation. MSCN is a multi-set convolutional network, tailored to representing relational query plans, that employs set semantics to capture query features and true cardinalities. MSCN builds on sampling-based estimation, addressing its weaknesses when no sampled tuples qualify a predicate, and in capturing join-crossing correlations. Our evaluation of MSCN using a real-world dataset shows that deep learning significantly enhances the quality of cardinality estimation, which is the core problem in query optimization.

Learned Cardinalities: Estimating Correlated Joins with Deep Learning

TL;DR

This work tackles cardinality estimation for query optimization by introducing MSCN, a multi-set convolutional network that represents queries as sets of tables, joins, and predicates. By applying per-element MLPs and averaging within each set, MSCN achieves a permutation-invariant, compact model that can incorporate materialized sample bitmaps to learn join-crossing correlations and handle 0-tuple scenarios. The approach is trained on synthetically generated queries with labels derived from actual data and augmented with sampling signals, and evaluated on the IMDb dataset where it competes with and often surpasses state-of-the-art sampling methods while using far less data. The results demonstrate robustness to challenging cases and highlight promising directions for extending the model to more complex predicates, uncertainty estimation, and update handling, offering a feasible ML-based alternative to traditional cardinality estimation techniques.

Abstract

We describe a new deep learning approach to cardinality estimation. MSCN is a multi-set convolutional network, tailored to representing relational query plans, that employs set semantics to capture query features and true cardinalities. MSCN builds on sampling-based estimation, addressing its weaknesses when no sampled tuples qualify a predicate, and in capturing join-crossing correlations. Our evaluation of MSCN using a real-world dataset shows that deep learning significantly enhances the quality of cardinality estimation, which is the core problem in query optimization.

Paper Structure

This paper contains 26 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Architecture of our multi-set convolutional network. Tables, joins, and predicates are represented as separate modules, comprised of one two-layer neural network per set element with shared parameters. Module outputs are averaged, concatenated, and fed into a final output network.
  • Figure 2: Query featurization as sets of feature vectors.
  • Figure 3: Estimation errors on the synthetic workload. The box boundaries are at the 25th/75th percentiles and the horizontal "whisker" lines mark the 95th percentiles.
  • Figure 4: Estimation errors on the synthetic workload with different model variants.
  • Figure 5: Estimation errors on the scale workload showing how MSCN generalizes to queries with more joins.
  • ...and 1 more figures