Table of Contents
Fetching ...

HoneyComb: A Parallel Worst-Case Optimal Join on Multicores

Jiacheng Wu, Dan Suciu

TL;DR

HoneyComb tackles scalable parallel evaluation of Worst-Case Optimal Joins on large shared-memory multicore systems by fully partitioning all query variables and replacing hash-based indexing with a sort-based CoCo structure. It couples HyperCube-inspired partitioning with an eager preprocessing pipeline, a cost-model guided optimizer, and a rewriting mechanism to reduce redundant work, enabling near-linear scaling to many cores. The Preprocessor builds CoCo indices for each partition, while the Executor runs a fully parallel Generic Join over partitioned data with minimal synchronization. Empirical results show HoneyComb outperforms state-of-the-art WCOJ systems on diverse graph and tensor workloads, achieving substantial speedups on skewed and dense data, albeit with preprocessing overhead and some dataset-specific tradeoffs.

Abstract

To achieve true scalability on massive datasets, a modern query engine needs to be able to take advantage of large, shared-memory, multicore systems. Binary joins are conceptually easy to parallelize on a multicore system; however, several applications require a different approach to query evaluation, using a Worst-Case Optimal Join (WCOJ) algorithm. WCOJ is known to outperform traditional query plans for cyclic queries. However, there is no obvious adaptation of WCOJ to parallel architectures. The few existing systems that parallelize WCOJ do this by partitioning only the top variable of the WCOJ algorithm. This leads to work skew (since some relations end up being read entirely by every thread), possible contention between threads (when the hierarchical trie index is built lazily, which is the case on most recent WCOJ systems), and exacerbates the redundant computations already existing in WCOJ. We introduce HoneyComb, a parallel version of WCOJ, optimized for large multicore, shared-memory systems. HoneyComb partitions the domains of all query variables, not just that of the top loop. We adapt the partitioning idea from the HyperCube algorithm, developed by the theory community for computing multi-join queries on a massively parallel shared-nothing architecture, and introduce new methods for computing the shares, optimized for a shared-memory architecture. To avoid the contention created by the lazy construction of the trie-index, we introduce CoCo, a new and very simple index structure, which we build eagerly, by sorting the entire relation. Finally, in order to remove some of the redundant computations of WCOJ, we introduce a rewriting technique of the WCOJ plan that factors out some of these redundant computations. Our experimental evaluation compares HoneyComb with several recent implementations of WCOJ.

HoneyComb: A Parallel Worst-Case Optimal Join on Multicores

TL;DR

HoneyComb tackles scalable parallel evaluation of Worst-Case Optimal Joins on large shared-memory multicore systems by fully partitioning all query variables and replacing hash-based indexing with a sort-based CoCo structure. It couples HyperCube-inspired partitioning with an eager preprocessing pipeline, a cost-model guided optimizer, and a rewriting mechanism to reduce redundant work, enabling near-linear scaling to many cores. The Preprocessor builds CoCo indices for each partition, while the Executor runs a fully parallel Generic Join over partitioned data with minimal synchronization. Empirical results show HoneyComb outperforms state-of-the-art WCOJ systems on diverse graph and tensor workloads, achieving substantial speedups on skewed and dense data, albeit with preprocessing overhead and some dataset-specific tradeoffs.

Abstract

To achieve true scalability on massive datasets, a modern query engine needs to be able to take advantage of large, shared-memory, multicore systems. Binary joins are conceptually easy to parallelize on a multicore system; however, several applications require a different approach to query evaluation, using a Worst-Case Optimal Join (WCOJ) algorithm. WCOJ is known to outperform traditional query plans for cyclic queries. However, there is no obvious adaptation of WCOJ to parallel architectures. The few existing systems that parallelize WCOJ do this by partitioning only the top variable of the WCOJ algorithm. This leads to work skew (since some relations end up being read entirely by every thread), possible contention between threads (when the hierarchical trie index is built lazily, which is the case on most recent WCOJ systems), and exacerbates the redundant computations already existing in WCOJ. We introduce HoneyComb, a parallel version of WCOJ, optimized for large multicore, shared-memory systems. HoneyComb partitions the domains of all query variables, not just that of the top loop. We adapt the partitioning idea from the HyperCube algorithm, developed by the theory community for computing multi-join queries on a massively parallel shared-nothing architecture, and introduce new methods for computing the shares, optimized for a shared-memory architecture. To avoid the contention created by the lazy construction of the trie-index, we introduce CoCo, a new and very simple index structure, which we build eagerly, by sorting the entire relation. Finally, in order to remove some of the redundant computations of WCOJ, we introduce a rewriting technique of the WCOJ plan that factors out some of these redundant computations. Our experimental evaluation compares HoneyComb with several recent implementations of WCOJ.

Paper Structure

This paper contains 17 sections, 2 theorems, 7 equations, 16 figures, 4 tables.

Key Result

theorem 1

For any fractional edge cover $\bm w$ of $Q_P$, algorithm $P$ runs in time $\tilde{O}(\prod_j |R_j|^{w_j})$.

Figures (16)

  • Figure 1: Comparison of WCOJ Parallelization Algorithms
  • Figure 2: Skewness for Triangle Query on WGPB. Here each of the three lines represents the same number of partitions, $4096$, but allocated differently to the three attributes $X$, $Y$, and $Z$. The expression $4096 \times 1 \times 1$ (the blue line) means that the attribute $X$ was partitioned into $4096$ buckets and assigned to the threads, while the attributes $Y$, $Z$ were not partitioned (or partitioned into $1$ bucket). Similarly, $16 \times 16 \times 16$ (the orange line) means that each of $X$, $Y$, $Z$ was partitioned into $16$ buckets, for a total of $4096$ buckets/threads. Each point in the graph represents the runtime of one of the $4096$ threads, run in isolation, on one core; when executed on all $60$ cores, these threads will be dynamically scheduled on all $60$ cores, using work stealing.
  • Figure 3: (Adapted from umbratechreport) Conflicts in Traditional Parallel WCOJ Algorithm. The blue thread tries to access the sub-trie $h_2(1)$ in $S.Y$, while the red thread is constructing it.
  • Figure 4: The HyperCube Partition
  • Figure 5: The Architecture of HoneyComb
  • ...and 11 more figures

Theorems & Definitions (2)

  • theorem 1
  • lemma 1