Duplicate Detection with GenAI

Ian Ormesher

Duplicate Detection with GenAI

Ian Ormesher

TL;DR

This paper tackles the problem of duplicate detection in CRM data by reframing entity matching around embedding-based representations generated from match sentences and using DBSCAN for clustering. It replaces the traditional candidate-generation step with a preprocessing-free workflow that leverages pretrained large language model embeddings ($768$-dimensional) to form clusters via cosine similarity, achieving substantial improvements over baseline NLP approaches on benchmark datasets (up to roughly $60\%$ de-duplication accuracy). Key contributions include a practical, training-free pipeline, language-aware matching, and demonstrating robustness across datasets (Customer Data and Musicbrainz) with visualisation of clustering via UMAP. The approach promises practical impact in data quality, user engagement, and easy adaptability to future, better embedding models.

Abstract

Customer data is often stored as records in Customer Relations Management systems (CRMs). Data which is manually entered into such systems by one of more users over time leads to data replication, partial duplication or fuzzy duplication. This in turn means that there no longer a single source of truth for customers, contacts, accounts, etc. Downstream business processes become increasing complex and contrived without a unique mapping between a record in a CRM and the target customer. Current methods to detect and de-duplicate records use traditional Natural Language Processing techniques known as Entity Matching. In this paper we show how using the latest advancements in Large Language Models and Generative AI can vastly improve the identification and repair of duplicated records. On common benchmark datasets we find an improvement in the accuracy of data de-duplication rates from 30 percent using NLP techniques to almost 60 percent using our proposed method.

Duplicate Detection with GenAI

TL;DR

-dimensional) to form clusters via cosine similarity, achieving substantial improvements over baseline NLP approaches on benchmark datasets (up to roughly

de-duplication accuracy). Key contributions include a practical, training-free pipeline, language-aware matching, and demonstrating robustness across datasets (Customer Data and Musicbrainz) with visualisation of clustering via UMAP. The approach promises practical impact in data quality, user engagement, and easy adaptability to future, better embedding models.

Abstract

Paper Structure (27 sections, 6 figures, 6 tables, 1 algorithm)

This paper contains 27 sections, 6 figures, 6 tables, 1 algorithm.

Introduction
Traditional Approach
Candidate Generation
Blocking
Matching
Clustering
Proposed Method
Create Match Sentences
Create Embedding Vectors
Clustering
Experiments
Customer Data Experiments
Results
Musicbrainz Experiments
Match Sentence
...and 12 more sections

Figures (6)

Figure 1: The five steps of a typical duplicate detection pipeline based on pairwise record comparisons
Figure 2: The steps to creating the embedding vector
Figure 3: Experimental results epsilon against F-score
Figure 4: Match Groups for the Musicbrainz 200K dataset (epsilon=0.245)
Figure 5: 2D UMAP Musicbrainz 200K nearest neighbour plot
...and 1 more figures

Duplicate Detection with GenAI

TL;DR

Abstract

Duplicate Detection with GenAI

Authors

TL;DR

Abstract

Table of Contents

Figures (6)