HyperBlocker: Accelerating Rule-based Blocking in Entity Resolution using GPUs
Xiaoke Zhu, Min Xie, Ting Deng, Qi Zhang
TL;DR
HyperBlocker tackles the latency of rule-based blocking in Entity Resolution by introducing a GPU-accelerated, pipelined architecture that overlaps CPU data handling with GPU computation. It combines a data-aware and rule-aware Execution Plan Generator with hardware-conscious GPU optimizations and a multi-GPU scheduler to achieve massive parallelism. The core contributions are the data/rule-aware execution plan (EPG), a GPU-centric execution model with divergence-mitigating techniques (PSW and task-stealing), and cross-GPU collaboration strategies, all validated on real-world datasets where HyperBlocker outperforms CPU-based and GPU baselines while preserving competitive accuracy. Together, these advances yield substantial runtime speedups (often 6–20×) and scalable blocking up to tens of millions of tuples, enhancing ER workflows where blocking is the bottleneck and enabling tighter, faster data cleaning and integration pipelines.
Abstract
This paper studies rule-based blocking in Entity Resolution (ER). We propose HyperBlocker, a GPU-accelerated system for blocking in ER. As opposed to previous blocking algorithms and parallel blocking solvers, HyperBlocker employs a pipelined architecture to overlap data transfer and GPU operations. It generates a dataaware and rule-aware execution plan on CPUs, for specifying how rules are evaluated, and develops a number of hardware-aware optimizations to achieve massive parallelism on GPUs. Using reallife datasets, we show that HyperBlocker is at least 6.8x and 9.1x faster than prior CPU-powered distributed systems and GPU-based ER solvers, respectively. Better still, by combining HyperBlocker with the state-of-the-art ER matcher, we can speed up the overall ER process by at least 30% with comparable accuracy.
