Table of Contents
Fetching ...

Optimizing Queries with Many-to-Many Joins

Hasara Kalumin, Amol Deshpande

TL;DR

This work tackles the problem of optimizing queries that feature many-to-many joins, a regime where intermediate results can explode and traditional plans underperform. It introduces a cost model that accounts for postponed intermediate results and redundant probes, and develops optimization algorithms for left-deep plans under factorized (COM), semijoin (SJ), and bitvector-based pruning (BVP) strategies. Through a vectorized prototype and extensive experiments on synthetic and real benchmarks, the authors show that factorized representations coupled with robust cost modeling can dramatically reduce probe counts and execution time, while also reducing sensitivity to join-order estimation errors. The results suggest significant practical impact for graph- and pattern-centric workloads, offering guidance on when to use COM, SJ, and BVP in combination and highlighting robustness improvements over classic rank-ordering approaches.

Abstract

As database query processing techniques are being used to handle diverse workloads, a key emerging challenge is how to efficiently handle multi-way join queries containing multiple many-to-many joins. While uncommon in traditional enterprise settings that have been the focus of much of the query optimization work to date, such queries are seen frequently in other contexts such as graph workloads. This has led to much work on developing join algorithms for handling cyclic queries, on compressed (factorized) representations for more efficient storage of intermediate results, and on use of semi-joins or predicate transfer to avoid generating large redundant intermediate results. In this paper, we address a core query optimization problem in this context. Specifically, we introduce an improved cost model that more accurately captures the cost of a query plan in such scenarios, and we present several optimization algorithms for query optimization that incorporate these new cost functions. We present an extensive experimental evaluation, that compares the factorized representation approach with a full semi-join reduction approach as well as to an approach that uses bitvectors to eliminate tuples early through sideways information passing. We also present new analyses of robustness of these techniques to the choice of the join order, potentially eliminating the need for more complex query optimization and selectivity estimation techniques.

Optimizing Queries with Many-to-Many Joins

TL;DR

This work tackles the problem of optimizing queries that feature many-to-many joins, a regime where intermediate results can explode and traditional plans underperform. It introduces a cost model that accounts for postponed intermediate results and redundant probes, and develops optimization algorithms for left-deep plans under factorized (COM), semijoin (SJ), and bitvector-based pruning (BVP) strategies. Through a vectorized prototype and extensive experiments on synthetic and real benchmarks, the authors show that factorized representations coupled with robust cost modeling can dramatically reduce probe counts and execution time, while also reducing sensitivity to join-order estimation errors. The results suggest significant practical impact for graph- and pattern-centric workloads, offering guidance on when to use COM, SJ, and BVP in combination and highlighting robustness improvements over classic rank-ordering approaches.

Abstract

As database query processing techniques are being used to handle diverse workloads, a key emerging challenge is how to efficiently handle multi-way join queries containing multiple many-to-many joins. While uncommon in traditional enterprise settings that have been the focus of much of the query optimization work to date, such queries are seen frequently in other contexts such as graph workloads. This has led to much work on developing join algorithms for handling cyclic queries, on compressed (factorized) representations for more efficient storage of intermediate results, and on use of semi-joins or predicate transfer to avoid generating large redundant intermediate results. In this paper, we address a core query optimization problem in this context. Specifically, we introduce an improved cost model that more accurately captures the cost of a query plan in such scenarios, and we present several optimization algorithms for query optimization that incorporate these new cost functions. We present an extensive experimental evaluation, that compares the factorized representation approach with a full semi-join reduction approach as well as to an approach that uses bitvectors to eliminate tuples early through sideways information passing. We also present new analyses of robustness of these techniques to the choice of the join order, potentially eliminating the need for more complex query optimization and selectivity estimation techniques.

Paper Structure

This paper contains 29 sections, 5 theorems, 17 equations, 16 figures, 1 algorithm.

Key Result

theorem 1

The cost function developed above does not satisfy the ASI property.

Figures (16)

  • Figure 1: (i) An example 6-relation query used as the running example, and (ii) its join graph with edges annotated with the join attributes; (iii) A left-deep query plan and partial execution of a single $R1$ tuple.
  • Figure 2: Yannakakis algorithm uses two semi-join passes to fully reduce the relations; arrows indicate the probe direction (e.g., Phase 1 first operation 1 is $R_2 \ltimes R_3$ to reduce $R_2$)
  • Figure 3: Pushing down bitvectors for early pruning
  • Figure 4: Sampling is highly effective at estimating match probabilities and fanouts ($* \rightarrow$ stddev. = 9.44)
  • Figure 5: (i) Match probabilities and fanouts for the running example; (ii) A partially evaluated query plan
  • ...and 11 more figures

Theorems & Definitions (5)

  • theorem 1
  • theorem 2
  • theorem 3
  • theorem 4
  • theorem 5