Efficient Candidate-Free R-S Set Similarity Joins with Filter-and-Verification Trees on MapReduce
Yuhong Feng, Fangcao Jian, Yixuan Cao, Xiaobin Jian, Jia Wang, Haiyue Feng, Chunyan Miao
TL;DR
The paper tackles the scalability bottleneck of exact R-S set similarity joins by eliminating candidate generation through a candidate-free framework built on filter-and-verification trees (FVT) and its Linear variant (LFVT). It introduces single-stage CF-RS-Join algorithms and distributed MR-CF-RS-Join variants that perform filtering and verification concurrently in memory, augmented by load-aware data partitioning and dynamic-programming-based partitioning. The approach yields significant performance gains over state-of-the-art baselines across seven real-world datasets, including substantial reductions in I/O, memory, and disk usage. This work enables scalable exact R-S joins on large datasets and provides a practical MapReduce solution with strong data- and cluster-scale efficiency.
Abstract
Given two different collections of sets R and S, the exact R-S set similarity join (R-S Join) finds all set pairs with similarity no less than a given threshold, which has widespread applications. Existing algorithms accelerate large-scale R-S Joins using a two-stage filter-and-verification framework along with the parallel and distributed MapReduce framework, however, they suffer from excessive candidate set pairs (candidates), leading to significant I/O and verification overhead. This paper proposes novel candidate-free R-S Join (CF-RS-Join) algorithms that integrate filtering and verification into a single stage through the filter-and-verification tree (FVT) and its linear variant (LFVT). First, CF-RS-Join with FVT (CF-RS-Join/FVT) is proposed to leverage an innovative FVT structure that compresses elements and associated sets in memory, enabling single-stage processing that eliminates candidate generation, enables fast lookups, and reduces database scans. Correctness proofs are provided. Second, CF-RS-Join with LFVT (CF-RS-Join/LFVT) is proposed to exploit a more compact Linear FVT, which compresses non-branching paths into single nodes and stores them in linear arrays for optimized traversal. Third, MR-CF-RS-Join/FVT and MR-CF-RS-Join/LFVT are proposed to extend our approaches using MapReduce for parallel processing. Extensive experiments have been conducted on the proposed algorithms against state-of-the-art (SOTA) baselines in terms of execution time, scalability, memory usage, and disk usage. The results show that MR-CF-RS-Join/LFVT outperforms the runner-up by up to 1.37x-15.78x on 7 real-world datasets.
