Table of Contents
Fetching ...

Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join Queries

Mike Heddes, Igor Nunes, Tony Givargis, Alex Nicolau

TL;DR

The paper tackles the challenge of fast, accurate cardinality estimation for multi-join queries in streaming settings. It introduces a novel sketch that replaces the Hadamard merge with circular convolution and uses circular cross-correlation for inference, enabling constant-time per-tuple updates and FFT-based efficient inference. The authors prove unbiasedness and provide a variance-based error bound that mirrors AMS-style guarantees, with experiments showing orders-of-magnitude faster updates and superior accuracy over baselines and learning-based methods. The approach demonstrates strong practical impact for query optimization, achieving substantial runtime reductions in PostgreSQL and outperforming existing sketching and ML-based estimators on real benchmarks.

Abstract

With the increasing rate of data generated by critical systems, estimating functions on streaming data has become essential. This demand has driven numerous advancements in algorithms designed to efficiently query and analyze one or more data streams while operating under memory constraints. The primary challenge arises from the rapid influx of new items, requiring algorithms that enable efficient incremental processing of streams in order to keep up. A prominent algorithm in this domain is the AMS sketch. Originally developed to estimate the second frequency moment of a data stream, it can also estimate the cardinality of the equi-join between two relations. Since then, two important advancements are the Count sketch, a method which significantly improves upon the sketch update time, and secondly, an extension of the AMS sketch to accommodate multi-join queries. However, combining the strengths of these methods to maintain sketches for multi-join queries while ensuring fast update times is a non-trivial task, and has remained an open problem for decades as highlighted in the existing literature. In this work, we successfully address this problem by introducing a novel sketching method which has fast updates, even for sketches capable of accurately estimating the cardinality of complex multi-join queries. We prove that our estimator is unbiased and has the same error guarantees as the AMS-based method. Our experimental results confirm the significant improvement in update time complexity, resulting in orders of magnitude faster estimates, with equal or better estimation accuracy.

Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join Queries

TL;DR

The paper tackles the challenge of fast, accurate cardinality estimation for multi-join queries in streaming settings. It introduces a novel sketch that replaces the Hadamard merge with circular convolution and uses circular cross-correlation for inference, enabling constant-time per-tuple updates and FFT-based efficient inference. The authors prove unbiasedness and provide a variance-based error bound that mirrors AMS-style guarantees, with experiments showing orders-of-magnitude faster updates and superior accuracy over baselines and learning-based methods. The approach demonstrates strong practical impact for query optimization, achieving substantial runtime reductions in PostgreSQL and outperforming existing sketching and ML-based estimators on real benchmarks.

Abstract

With the increasing rate of data generated by critical systems, estimating functions on streaming data has become essential. This demand has driven numerous advancements in algorithms designed to efficiently query and analyze one or more data streams while operating under memory constraints. The primary challenge arises from the rapid influx of new items, requiring algorithms that enable efficient incremental processing of streams in order to keep up. A prominent algorithm in this domain is the AMS sketch. Originally developed to estimate the second frequency moment of a data stream, it can also estimate the cardinality of the equi-join between two relations. Since then, two important advancements are the Count sketch, a method which significantly improves upon the sketch update time, and secondly, an extension of the AMS sketch to accommodate multi-join queries. However, combining the strengths of these methods to maintain sketches for multi-join queries while ensuring fast update times is a non-trivial task, and has remained an open problem for decades as highlighted in the existing literature. In this work, we successfully address this problem by introducing a novel sketching method which has fast updates, even for sketches capable of accurately estimating the cardinality of complex multi-join queries. We prove that our estimator is unbiased and has the same error guarantees as the AMS-based method. Our experimental results confirm the significant improvement in update time complexity, resulting in orders of magnitude faster estimates, with equal or better estimation accuracy.
Paper Structure (25 sections, 4 theorems, 10 equations, 9 figures, 6 tables, 3 algorithms)

This paper contains 25 sections, 4 theorems, 10 equations, 9 figures, 6 tables, 3 algorithms.

Key Result

Theorem 2.1

For any vectors ${\bm{f}}, {\bm{g}} \in \mathbb{R}^n$ and a random matrix ${\bm{\Pi}} \in \mathbb{R}^{m\times n}$ constructed by 4-wise independent hash functions $s_j\colon\lbrack n\rbrack\to\set{-1, +1}$ for $j \in \lbrack m\rbrack$ and ${\Pi}_{j,i} = s_j(i)$, we have:

Figures (9)

  • Figure 1: Streaming query-processing scheme
  • Figure 2: Comparison of the AMS and Count sketches performing a sketch update for an item in the stream.
  • Figure 3: Example join graph and corresponding SQL query. Additional attributes in each relation, not involved in the join, are omitted for clarity.
  • Figure 4: Comparison of the Hadamard product (left) and circular convolution (right) on two single-item Count Sketches. The resulting sketch represents the 2-tuple $(a, b)$.
  • Figure 5: Total number of entries across all columns of both the STATS and IMDB databases, grouped by the best fit Zipf parameter of each column. Synthetic entries refer to the id and md5sum columns, which are unique by design, the real entries include all other columns.
  • ...and 4 more figures

Theorems & Definitions (8)

  • Definition 2.1: $k$-wise independence wegman1981newpagh2013compressed
  • Theorem 2.1: AMS sketch
  • Theorem 2.2: Count sketch
  • Definition 2.2: Circular convolution
  • Definition 2.3: Tensor sketch pagh2013compressedpham2013fast
  • Theorem 2.3
  • Definition 3.1: Circular cross-correlation
  • Theorem 3.1