Table of Contents
Fetching ...

Transformer-Gather, Fuzzy-Reconsider: A Scalable Hybrid Framework for Entity Resolution

Mohammadreza Sharifi, Danial Ahmadzadeh

TL;DR

The paper tackles scalable entity resolution in noisy enterprise data by proposing TGFR, a hybrid pipeline that couples semantic embeddings from a pre-trained transformer with KNN-based candidate retrieval and deterministic fuzzy verification. It uses a serialized natural-language representation of records, a fine-tuned all-mpnet-base-v2 encoder with MNR-Loss for ground-truth labeling, and a cosine similarity threshold to finalize matches, achieving a retrieval recall near $0.97$ on CPU-based infrastructure. Experimental results show that transformer-based embeddings outperform TF-IDF, that fine-tuning improves recall, and that adding fuzzy string matching yields a $4.6$ percentage point boost in F1-score, demonstrating the value of a hybrid semantic-syntactic approach. The framework provides a practical, deployable solution for enterprise-level data integrity auditing on standard CPU hardware, with favorable scalability properties and robust performance in production environments.

Abstract

Entity resolution plays a significant role in enterprise systems where data integrity must be rigorously maintained. Traditional methods often struggle with handling noisy data or semantic understanding, while modern methods suffer from computational costs or the excessive need for parallel computation. In this study, we introduce a scalable hybrid framework, which is designed to address several important problems, including scalability, noise robustness, and reliable results. We utilized a pre-trained language model to encode each structured data into corresponding semantic embedding vectors. Subsequently, after retrieving a semantically relevant subset of candidates, we apply a syntactic verification stage using fuzzy string matching techniques to refine classification on the unlabeled data. This approach was applied to a real-world entity resolution task, which exposed a linkage between a central user management database and numerous shared hosting server records. Compared to other methods, this approach exhibits an outstanding performance in terms of both processing time and robustness, making it a reliable solution for a server-side product. Crucially, this efficiency does not compromise results, as the system maintains a high retrieval recall of approximately 0.97. The scalability of the framework makes it deployable on standard CPU-based infrastructure, offering a practical and effective solution for enterprise-level data integrity auditing.

Transformer-Gather, Fuzzy-Reconsider: A Scalable Hybrid Framework for Entity Resolution

TL;DR

The paper tackles scalable entity resolution in noisy enterprise data by proposing TGFR, a hybrid pipeline that couples semantic embeddings from a pre-trained transformer with KNN-based candidate retrieval and deterministic fuzzy verification. It uses a serialized natural-language representation of records, a fine-tuned all-mpnet-base-v2 encoder with MNR-Loss for ground-truth labeling, and a cosine similarity threshold to finalize matches, achieving a retrieval recall near on CPU-based infrastructure. Experimental results show that transformer-based embeddings outperform TF-IDF, that fine-tuning improves recall, and that adding fuzzy string matching yields a percentage point boost in F1-score, demonstrating the value of a hybrid semantic-syntactic approach. The framework provides a practical, deployable solution for enterprise-level data integrity auditing on standard CPU hardware, with favorable scalability properties and robust performance in production environments.

Abstract

Entity resolution plays a significant role in enterprise systems where data integrity must be rigorously maintained. Traditional methods often struggle with handling noisy data or semantic understanding, while modern methods suffer from computational costs or the excessive need for parallel computation. In this study, we introduce a scalable hybrid framework, which is designed to address several important problems, including scalability, noise robustness, and reliable results. We utilized a pre-trained language model to encode each structured data into corresponding semantic embedding vectors. Subsequently, after retrieving a semantically relevant subset of candidates, we apply a syntactic verification stage using fuzzy string matching techniques to refine classification on the unlabeled data. This approach was applied to a real-world entity resolution task, which exposed a linkage between a central user management database and numerous shared hosting server records. Compared to other methods, this approach exhibits an outstanding performance in terms of both processing time and robustness, making it a reliable solution for a server-side product. Crucially, this efficiency does not compromise results, as the system maintains a high retrieval recall of approximately 0.97. The scalability of the framework makes it deployable on standard CPU-based infrastructure, offering a practical and effective solution for enterprise-level data integrity auditing.

Paper Structure

This paper contains 15 sections, 4 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: An example of serializing a structured user record into a standardized sentence format.
  • Figure 2: Example t-SNE illustration of matching regions formed by 12 sample server query and their $5$ nearest neighbors. The color intensity reflects the average cosine distance of the territory members to the main server data.
  • Figure 3: Training loss over steps curve. This curve shows a smooth training process and a low final loss value.
  • Figure 4: Recall values comparison for different methods. Left to right: KNN, ANN, Query matching, DBSCAN+ANN, K-means+ANN, Lexical Search, and Brute Force.
  • Figure 5: Recall comparison between KNN and ANNOY in both setups with fine-tuning or without fine-tuning.
  • ...and 2 more figures