Table of Contents
Fetching ...

HyperBlocker: Accelerating Rule-based Blocking in Entity Resolution using GPUs

Xiaoke Zhu, Min Xie, Ting Deng, Qi Zhang

TL;DR

HyperBlocker tackles the latency of rule-based blocking in Entity Resolution by introducing a GPU-accelerated, pipelined architecture that overlaps CPU data handling with GPU computation. It combines a data-aware and rule-aware Execution Plan Generator with hardware-conscious GPU optimizations and a multi-GPU scheduler to achieve massive parallelism. The core contributions are the data/rule-aware execution plan (EPG), a GPU-centric execution model with divergence-mitigating techniques (PSW and task-stealing), and cross-GPU collaboration strategies, all validated on real-world datasets where HyperBlocker outperforms CPU-based and GPU baselines while preserving competitive accuracy. Together, these advances yield substantial runtime speedups (often 6–20×) and scalable blocking up to tens of millions of tuples, enhancing ER workflows where blocking is the bottleneck and enabling tighter, faster data cleaning and integration pipelines.

Abstract

This paper studies rule-based blocking in Entity Resolution (ER). We propose HyperBlocker, a GPU-accelerated system for blocking in ER. As opposed to previous blocking algorithms and parallel blocking solvers, HyperBlocker employs a pipelined architecture to overlap data transfer and GPU operations. It generates a dataaware and rule-aware execution plan on CPUs, for specifying how rules are evaluated, and develops a number of hardware-aware optimizations to achieve massive parallelism on GPUs. Using reallife datasets, we show that HyperBlocker is at least 6.8x and 9.1x faster than prior CPU-powered distributed systems and GPU-based ER solvers, respectively. Better still, by combining HyperBlocker with the state-of-the-art ER matcher, we can speed up the overall ER process by at least 30% with comparable accuracy.

HyperBlocker: Accelerating Rule-based Blocking in Entity Resolution using GPUs

TL;DR

HyperBlocker tackles the latency of rule-based blocking in Entity Resolution by introducing a GPU-accelerated, pipelined architecture that overlaps CPU data handling with GPU computation. It combines a data-aware and rule-aware Execution Plan Generator with hardware-conscious GPU optimizations and a multi-GPU scheduler to achieve massive parallelism. The core contributions are the data/rule-aware execution plan (EPG), a GPU-centric execution model with divergence-mitigating techniques (PSW and task-stealing), and cross-GPU collaboration strategies, all validated on real-world datasets where HyperBlocker outperforms CPU-based and GPU baselines while preserving competitive accuracy. Together, these advances yield substantial runtime speedups (often 6–20×) and scalable blocking up to tens of millions of tuples, enhancing ER workflows where blocking is the bottleneck and enabling tighter, faster data cleaning and integration pipelines.

Abstract

This paper studies rule-based blocking in Entity Resolution (ER). We propose HyperBlocker, a GPU-accelerated system for blocking in ER. As opposed to previous blocking algorithms and parallel blocking solvers, HyperBlocker employs a pipelined architecture to overlap data transfer and GPU operations. It generates a dataaware and rule-aware execution plan on CPUs, for specifying how rules are evaluated, and develops a number of hardware-aware optimizations to achieve massive parallelism on GPUs. Using reallife datasets, we show that HyperBlocker is at least 6.8x and 9.1x faster than prior CPU-powered distributed systems and GPU-based ER solvers, respectively. Better still, by combining HyperBlocker with the state-of-the-art ER matcher, we can speed up the overall ER process by at least 30% with comparable accuracy.
Paper Structure (13 sections, 6 equations, 8 figures, 4 tables)

This paper contains 13 sections, 6 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: DL-based blocking vs. rule-based blocking
  • Figure 2: Shared memory vs shared nothing architectures
  • Figure 3: A relation $D$ of schema $$Products, where the dash ("-") denotes a missing value.
  • Figure 4: The pipelined architecture of $$HyperBlocker
  • Figure 5: Execution tree
  • ...and 3 more figures