Table of Contents
Fetching ...

BlockingPy: approximate nearest neighbours for blocking of records for entity resolution

Tymoteusz Strojny, Maciej Beręsewicz

TL;DR

BlockingPy tackles the challenge of linking records without unique identifiers by introducing schema-agnostic blocking powered by ANN search and graph-based indexing, scalable to CPU and GPU execution. It unifies input formats (text, DTMs, embeddings) and multiple ANN backends into a single API, yielding blocks that dramatically reduce candidate comparisons while maintaining high linkage quality. The paper presents two case studies—record linkage and deduplication—along with evaluation metrics such as Reduction Ratio and confusion-matrix-based scores, demonstrating high accuracy and substantial computational savings. With real-world impact shown in official statistics workflows, BlockingPy offers an extensible framework for accurate, efficient entity resolution across heterogeneous data sources, and points toward privacy-preserving extensions in future work.

Abstract

Entity resolution (probabilistic record linkage, deduplication) is a key step in scientific analysis and data science pipelines involving multiple data sources. The objective of entity resolution is to link records without common unique identifiers that refer to the same entity (e.g., person, company). However, without identifiers, researchers need to specify which records to compare in order to calculate matching probability and reduce computational complexity. One solution is to deterministically block records based on some common variables, such as names, dates of birth or sex or use phonetic algorithms. However, this approach assumes that these variables are free of errors and completely observed, which is often not the case. To address this challenge, we have developed a Python package, BlockingPy, which uses blocking using modern approximate nearest neighbour search and graph algorithms to reduce the number of comparisons. The package supports both CPU and GPU execution. In this paper, we present the design of the package, its functionalities and two case studies related to official statistics. The presented software will be useful for researchers interested in linking data from various sources.

BlockingPy: approximate nearest neighbours for blocking of records for entity resolution

TL;DR

BlockingPy tackles the challenge of linking records without unique identifiers by introducing schema-agnostic blocking powered by ANN search and graph-based indexing, scalable to CPU and GPU execution. It unifies input formats (text, DTMs, embeddings) and multiple ANN backends into a single API, yielding blocks that dramatically reduce candidate comparisons while maintaining high linkage quality. The paper presents two case studies—record linkage and deduplication—along with evaluation metrics such as Reduction Ratio and confusion-matrix-based scores, demonstrating high accuracy and substantial computational savings. With real-world impact shown in official statistics workflows, BlockingPy offers an extensible framework for accurate, efficient entity resolution across heterogeneous data sources, and points toward privacy-preserving extensions in future work.

Abstract

Entity resolution (probabilistic record linkage, deduplication) is a key step in scientific analysis and data science pipelines involving multiple data sources. The objective of entity resolution is to link records without common unique identifiers that refer to the same entity (e.g., person, company). However, without identifiers, researchers need to specify which records to compare in order to calculate matching probability and reduce computational complexity. One solution is to deterministically block records based on some common variables, such as names, dates of birth or sex or use phonetic algorithms. However, this approach assumes that these variables are free of errors and completely observed, which is often not the case. To address this challenge, we have developed a Python package, BlockingPy, which uses blocking using modern approximate nearest neighbour search and graph algorithms to reduce the number of comparisons. The package supports both CPU and GPU execution. In this paper, we present the design of the package, its functionalities and two case studies related to official statistics. The presented software will be useful for researchers interested in linking data from various sources.

Paper Structure

This paper contains 16 sections, 2 equations, 1 figure, 7 tables.

Figures (1)

  • Figure 1: The Architecture of the BlockingPy package