Table of Contents
Fetching ...

One Join Order Does Not Fit All: Reducing Intermediate Results with Per-Split Query Plans

Yujun He, Hangdong Zhao, Simon Frisk, Yifei Yang, Kevin Kristensen, Paraschos Koutris, Xiangyao Yu

TL;DR

This work tackles the challenge of expensive intermediate results in cyclic multi-join queries by introducing SplitJoin, a front-end framework that inserts a split operator to partition input relations into light and heavy parts and apply per-part query plans using existing binary join engines. The authors develop a general splitting framework with a split phase and a join phase, prove worst-case optimality under a specific instantiation aligned with the AGM bound, and present practical heuristics (including co-splits, data-driven thresholds, and a split-aware optimizer) to enable real-world use. Experimental evaluation on DuckDB and Umbra with skewed graph datasets shows SplitJoin substantially reduces intermediate sizes and improves runtime, completing more queries than native plans and achieving notable speedups (e.g., DuckDB: 2.1x faster on average with 7.9x smaller intermediates; Umbra: 1.3x faster with 1.2x smaller intermediates). The results demonstrate that a lightweight front-end for partition-based optimization can bridge theory and practice, enabling more robust performance for skewed cyclic joins and guiding future developments in adaptive cost models and integration strategies with existing engines.

Abstract

Minimizing intermediate results is critical for efficient multi-join query processing. Although the seminal Yannakakis algorithm offers strong guarantees for acyclic queries, cyclic queries remain an open challenge. In this paper, we propose SplitJoin, a framework that introduces split as a first-class query operator. By partitioning input tables into heavy and light parts, SplitJoin allows different data partitions to use distinct query plans, with the goal of reducing intermediate sizes using existing binary join engines. We systematically explore the design space for split-based optimizations, including threshold selection, split strategies, and join ordering after splits. Implemented as a front-end to DuckDB and Umbra, SplitJoin achieves substantial improvements: on DuckDB, SplitJoin completes 43 social network queries (vs. 29 natively), achieving 2.1x faster runtime and 7.9x smaller intermediates on average (up to 13.6x and 74x, respectively); on Umbra, it completes 45 queries (vs. 35), achieving 1.3x speedups and 1.2x smaller intermediates on average (up to 6.1x and 2.1x, respectively).

One Join Order Does Not Fit All: Reducing Intermediate Results with Per-Split Query Plans

TL;DR

This work tackles the challenge of expensive intermediate results in cyclic multi-join queries by introducing SplitJoin, a front-end framework that inserts a split operator to partition input relations into light and heavy parts and apply per-part query plans using existing binary join engines. The authors develop a general splitting framework with a split phase and a join phase, prove worst-case optimality under a specific instantiation aligned with the AGM bound, and present practical heuristics (including co-splits, data-driven thresholds, and a split-aware optimizer) to enable real-world use. Experimental evaluation on DuckDB and Umbra with skewed graph datasets shows SplitJoin substantially reduces intermediate sizes and improves runtime, completing more queries than native plans and achieving notable speedups (e.g., DuckDB: 2.1x faster on average with 7.9x smaller intermediates; Umbra: 1.3x faster with 1.2x smaller intermediates). The results demonstrate that a lightweight front-end for partition-based optimization can bridge theory and practice, enabling more robust performance for skewed cyclic joins and guiding future developments in adaptive cost models and integration strategies with existing engines.

Abstract

Minimizing intermediate results is critical for efficient multi-join query processing. Although the seminal Yannakakis algorithm offers strong guarantees for acyclic queries, cyclic queries remain an open challenge. In this paper, we propose SplitJoin, a framework that introduces split as a first-class query operator. By partitioning input tables into heavy and light parts, SplitJoin allows different data partitions to use distinct query plans, with the goal of reducing intermediate sizes using existing binary join engines. We systematically explore the design space for split-based optimizations, including threshold selection, split strategies, and join ordering after splits. Implemented as a front-end to DuckDB and Umbra, SplitJoin achieves substantial improvements: on DuckDB, SplitJoin completes 43 social network queries (vs. 29 natively), achieving 2.1x faster runtime and 7.9x smaller intermediates on average (up to 13.6x and 74x, respectively); on Umbra, it completes 45 queries (vs. 35), achieving 1.3x speedups and 1.2x smaller intermediates on average (up to 6.1x and 2.1x, respectively).

Paper Structure

This paper contains 32 sections, 7 theorems, 11 equations, 9 figures, 6 tables, 3 algorithms.

Key Result

theorem 1

Let $Q$ be a natural join query with binary relations of at most size $N$. Then, the above instantiation of the splitting framework runs in time at most $\mathcal{AGM}(Q) = O(N^{\rho})$.

Figures (9)

  • Figure 1: Motivating Example on How Split Avoids Producing Intermediate Tuples That Do Not Contribute to the Final Result.
  • Figure 2: Overview of the Split Planner.
  • Figure 3: Example query graph and initial intermediate relations
  • Figure 4: Split Example on Q5, L/H indicates a join is Light/Heavy.
  • Figure 5: Degree distributions of six datasets. Nodes are ordered by descending degree, with the subject attribute and the object attribute distributions shown on a log–log scale.
  • ...and 4 more figures

Theorems & Definitions (9)

  • theorem 1
  • theorem 2
  • definition 1: Fractional Vertex Packing
  • theorem 3: AGM Bound
  • definition 2
  • lemma 1
  • lemma 2
  • lemma 3
  • theorem 4