Table of Contents
Fetching ...

Neural Locality Sensitive Hashing for Entity Blocking

Runhui Wang, Luyang Kong, Yefan Tao, Andrew Borthwick, Davor Golac, Henrik Johnson, Shadie Hijazi, Dong Deng, Yongfeng Zhang

TL;DR

This work addresses the blocking stage of entity resolution by learning LSH-like hashing under task-specific similarity rules. It introduces Neural-LSH Block (NLSH-Block), which fine-tunes a RoBERTa-based encoder with a novel LSH-aligned loss to map items into a space where similar items are close and dissimilar items are far, enabling efficient kNN-based blocking. Across five real-world ER datasets, NLSH-Block achieves superior blocking F1 scores and reduces candidate sets size, often outperforming state-of-the-art blocking baselines. Moreover, the learned embeddings improve semi-supervised entity matching by producing higher-quality pseudo labels, boosting downstream matching performance. The approach offers a general, scalable way to tailor LSH to complex similarity functions and can extend to other domains requiring metric-aware hashing. $L_{NLSH}$ captures the desired locality-sensitive behavior: $\mathcal{L}_{NLSH} = \max(R, |\mathrm{NLSH}(p)-\mathrm{NLSH}(q)|) - \min(cR, |\mathrm{NLSH}(p)-\mathrm{NLSH}(r)|)$, guiding the network to place true matches within a collision radius $R$ while separating non-matches beyond $cR$.

Abstract

Locality-sensitive hashing (LSH) is a fundamental algorithmic technique widely employed in large-scale data processing applications, such as nearest-neighbor search, entity resolution, and clustering. However, its applicability in some real-world scenarios is limited due to the need for careful design of hashing functions that align with specific metrics. Existing LSH-based Entity Blocking solutions primarily rely on generic similarity metrics such as Jaccard similarity, whereas practical use cases often demand complex and customized similarity rules surpassing the capabilities of generic similarity metrics. Consequently, designing LSH functions for these customized similarity rules presents considerable challenges. In this research, we propose a neuralization approach to enhance locality-sensitive hashing by training deep neural networks to serve as hashing functions for complex metrics. We assess the effectiveness of this approach within the context of the entity resolution problem, which frequently involves the use of task-specific metrics in real-world applications. Specifically, we introduce NLSHBlock (Neural-LSH Block), a novel blocking methodology that leverages pre-trained language models, fine-tuned with a novel LSH-based loss function. Through extensive evaluations conducted on a diverse range of real-world datasets, we demonstrate the superiority of NLSHBlock over existing methods, exhibiting significant performance improvements. Furthermore, we showcase the efficacy of NLSHBlock in enhancing the performance of the entity matching phase, particularly within the semi-supervised setting.

Neural Locality Sensitive Hashing for Entity Blocking

TL;DR

This work addresses the blocking stage of entity resolution by learning LSH-like hashing under task-specific similarity rules. It introduces Neural-LSH Block (NLSH-Block), which fine-tunes a RoBERTa-based encoder with a novel LSH-aligned loss to map items into a space where similar items are close and dissimilar items are far, enabling efficient kNN-based blocking. Across five real-world ER datasets, NLSH-Block achieves superior blocking F1 scores and reduces candidate sets size, often outperforming state-of-the-art blocking baselines. Moreover, the learned embeddings improve semi-supervised entity matching by producing higher-quality pseudo labels, boosting downstream matching performance. The approach offers a general, scalable way to tailor LSH to complex similarity functions and can extend to other domains requiring metric-aware hashing. captures the desired locality-sensitive behavior: , guiding the network to place true matches within a collision radius while separating non-matches beyond .

Abstract

Locality-sensitive hashing (LSH) is a fundamental algorithmic technique widely employed in large-scale data processing applications, such as nearest-neighbor search, entity resolution, and clustering. However, its applicability in some real-world scenarios is limited due to the need for careful design of hashing functions that align with specific metrics. Existing LSH-based Entity Blocking solutions primarily rely on generic similarity metrics such as Jaccard similarity, whereas practical use cases often demand complex and customized similarity rules surpassing the capabilities of generic similarity metrics. Consequently, designing LSH functions for these customized similarity rules presents considerable challenges. In this research, we propose a neuralization approach to enhance locality-sensitive hashing by training deep neural networks to serve as hashing functions for complex metrics. We assess the effectiveness of this approach within the context of the entity resolution problem, which frequently involves the use of task-specific metrics in real-world applications. Specifically, we introduce NLSHBlock (Neural-LSH Block), a novel blocking methodology that leverages pre-trained language models, fine-tuned with a novel LSH-based loss function. Through extensive evaluations conducted on a diverse range of real-world datasets, we demonstrate the superiority of NLSHBlock over existing methods, exhibiting significant performance improvements. Furthermore, we showcase the efficacy of NLSHBlock in enhancing the performance of the entity matching phase, particularly within the semi-supervised setting.
Paper Structure (22 sections, 10 figures, 5 tables)

This paper contains 22 sections, 10 figures, 5 tables.

Figures (10)

  • Figure 1: An Example of Customized Similarity Metric
  • Figure 2: Entity Resolution: determine the matching entries from two datasets.
  • Figure 3: An example for serialization of items
  • Figure 4: Architecture of Neural-LSH. The input tables are serialized to text sequences first. The training involves generating augmented sequences and randomly sampling negative examples. After training with the loss fuction $\mathcal{L_{LSH}}$, the model $LM$ will generate embeddings for finding candidate pairs with kNN search.
  • Figure 5: Visualization of ideally hashed items
  • ...and 5 more figures