Table of Contents
Fetching ...

Efficient Candidate-Free R-S Set Similarity Joins with Filter-and-Verification Trees on MapReduce

Yuhong Feng, Fangcao Jian, Yixuan Cao, Xiaobin Jian, Jia Wang, Haiyue Feng, Chunyan Miao

TL;DR

The paper tackles the scalability bottleneck of exact R-S set similarity joins by eliminating candidate generation through a candidate-free framework built on filter-and-verification trees (FVT) and its Linear variant (LFVT). It introduces single-stage CF-RS-Join algorithms and distributed MR-CF-RS-Join variants that perform filtering and verification concurrently in memory, augmented by load-aware data partitioning and dynamic-programming-based partitioning. The approach yields significant performance gains over state-of-the-art baselines across seven real-world datasets, including substantial reductions in I/O, memory, and disk usage. This work enables scalable exact R-S joins on large datasets and provides a practical MapReduce solution with strong data- and cluster-scale efficiency.

Abstract

Given two different collections of sets R and S, the exact R-S set similarity join (R-S Join) finds all set pairs with similarity no less than a given threshold, which has widespread applications. Existing algorithms accelerate large-scale R-S Joins using a two-stage filter-and-verification framework along with the parallel and distributed MapReduce framework, however, they suffer from excessive candidate set pairs (candidates), leading to significant I/O and verification overhead. This paper proposes novel candidate-free R-S Join (CF-RS-Join) algorithms that integrate filtering and verification into a single stage through the filter-and-verification tree (FVT) and its linear variant (LFVT). First, CF-RS-Join with FVT (CF-RS-Join/FVT) is proposed to leverage an innovative FVT structure that compresses elements and associated sets in memory, enabling single-stage processing that eliminates candidate generation, enables fast lookups, and reduces database scans. Correctness proofs are provided. Second, CF-RS-Join with LFVT (CF-RS-Join/LFVT) is proposed to exploit a more compact Linear FVT, which compresses non-branching paths into single nodes and stores them in linear arrays for optimized traversal. Third, MR-CF-RS-Join/FVT and MR-CF-RS-Join/LFVT are proposed to extend our approaches using MapReduce for parallel processing. Extensive experiments have been conducted on the proposed algorithms against state-of-the-art (SOTA) baselines in terms of execution time, scalability, memory usage, and disk usage. The results show that MR-CF-RS-Join/LFVT outperforms the runner-up by up to 1.37x-15.78x on 7 real-world datasets.

Efficient Candidate-Free R-S Set Similarity Joins with Filter-and-Verification Trees on MapReduce

TL;DR

The paper tackles the scalability bottleneck of exact R-S set similarity joins by eliminating candidate generation through a candidate-free framework built on filter-and-verification trees (FVT) and its Linear variant (LFVT). It introduces single-stage CF-RS-Join algorithms and distributed MR-CF-RS-Join variants that perform filtering and verification concurrently in memory, augmented by load-aware data partitioning and dynamic-programming-based partitioning. The approach yields significant performance gains over state-of-the-art baselines across seven real-world datasets, including substantial reductions in I/O, memory, and disk usage. This work enables scalable exact R-S joins on large datasets and provides a practical MapReduce solution with strong data- and cluster-scale efficiency.

Abstract

Given two different collections of sets R and S, the exact R-S set similarity join (R-S Join) finds all set pairs with similarity no less than a given threshold, which has widespread applications. Existing algorithms accelerate large-scale R-S Joins using a two-stage filter-and-verification framework along with the parallel and distributed MapReduce framework, however, they suffer from excessive candidate set pairs (candidates), leading to significant I/O and verification overhead. This paper proposes novel candidate-free R-S Join (CF-RS-Join) algorithms that integrate filtering and verification into a single stage through the filter-and-verification tree (FVT) and its linear variant (LFVT). First, CF-RS-Join with FVT (CF-RS-Join/FVT) is proposed to leverage an innovative FVT structure that compresses elements and associated sets in memory, enabling single-stage processing that eliminates candidate generation, enables fast lookups, and reduces database scans. Correctness proofs are provided. Second, CF-RS-Join with LFVT (CF-RS-Join/LFVT) is proposed to exploit a more compact Linear FVT, which compresses non-branching paths into single nodes and stores them in linear arrays for optimized traversal. Third, MR-CF-RS-Join/FVT and MR-CF-RS-Join/LFVT are proposed to extend our approaches using MapReduce for parallel processing. Extensive experiments have been conducted on the proposed algorithms against state-of-the-art (SOTA) baselines in terms of execution time, scalability, memory usage, and disk usage. The results show that MR-CF-RS-Join/LFVT outperforms the runner-up by up to 1.37x-15.78x on 7 real-world datasets.

Paper Structure

This paper contains 23 sections, 3 theorems, 2 equations, 13 figures, 5 tables, 1 algorithm.

Key Result

Lemma 1

Given two sets $r$ and $s$, and a threshold $t$, if $\texttt{sim}(r, s) \ge t$, then $lb_r \le |s| \le ub_r$. When $\texttt{sim} = \text{Jaccard}$, $lb_r =\lceil |r|\times t\rceil, ub_r= \lfloor |r|/t \rfloor$ (cf. Table tab:sim coefficients).

Figures (13)

  • Figure 1: Candidate-Based vs. Candidate-Free R-S Joins
  • Figure 2: Set collections $\text{R}$, $\text{S}$, $\text{S}'$ (reorganized from $\text{S}$), and the FVT $\texttt{FVT}_{\text{S}}$ constructed over $\text{S}'$
  • Figure 3: The construction of an LFVT over $\text{S}$
  • Figure 4: MR-CF-RS-Join/FVT over $\text{R}$ and $\text{S}$$(k=2)$
  • Figure 5: Histogram of elements and set sizes
  • ...and 8 more figures

Theorems & Definitions (11)

  • Definition 1: R-S Set Similarity Join
  • Example 1
  • Example 2
  • Example 3
  • Lemma 1: Length Filter
  • Lemma 2: Correctness of CF-RS-Join/FVT w/o Length Filter
  • proof
  • Theorem 1: Correctness of CF-RS-Join/FVT w/ Length Filter
  • proof
  • Example 4
  • ...and 1 more