Table of Contents
Fetching ...

Duplicate Detection with GenAI

Ian Ormesher

TL;DR

This paper tackles the problem of duplicate detection in CRM data by reframing entity matching around embedding-based representations generated from match sentences and using DBSCAN for clustering. It replaces the traditional candidate-generation step with a preprocessing-free workflow that leverages pretrained large language model embeddings ($768$-dimensional) to form clusters via cosine similarity, achieving substantial improvements over baseline NLP approaches on benchmark datasets (up to roughly $60\%$ de-duplication accuracy). Key contributions include a practical, training-free pipeline, language-aware matching, and demonstrating robustness across datasets (Customer Data and Musicbrainz) with visualisation of clustering via UMAP. The approach promises practical impact in data quality, user engagement, and easy adaptability to future, better embedding models.

Abstract

Customer data is often stored as records in Customer Relations Management systems (CRMs). Data which is manually entered into such systems by one of more users over time leads to data replication, partial duplication or fuzzy duplication. This in turn means that there no longer a single source of truth for customers, contacts, accounts, etc. Downstream business processes become increasing complex and contrived without a unique mapping between a record in a CRM and the target customer. Current methods to detect and de-duplicate records use traditional Natural Language Processing techniques known as Entity Matching. In this paper we show how using the latest advancements in Large Language Models and Generative AI can vastly improve the identification and repair of duplicated records. On common benchmark datasets we find an improvement in the accuracy of data de-duplication rates from 30 percent using NLP techniques to almost 60 percent using our proposed method.

Duplicate Detection with GenAI

TL;DR

This paper tackles the problem of duplicate detection in CRM data by reframing entity matching around embedding-based representations generated from match sentences and using DBSCAN for clustering. It replaces the traditional candidate-generation step with a preprocessing-free workflow that leverages pretrained large language model embeddings (-dimensional) to form clusters via cosine similarity, achieving substantial improvements over baseline NLP approaches on benchmark datasets (up to roughly de-duplication accuracy). Key contributions include a practical, training-free pipeline, language-aware matching, and demonstrating robustness across datasets (Customer Data and Musicbrainz) with visualisation of clustering via UMAP. The approach promises practical impact in data quality, user engagement, and easy adaptability to future, better embedding models.

Abstract

Customer data is often stored as records in Customer Relations Management systems (CRMs). Data which is manually entered into such systems by one of more users over time leads to data replication, partial duplication or fuzzy duplication. This in turn means that there no longer a single source of truth for customers, contacts, accounts, etc. Downstream business processes become increasing complex and contrived without a unique mapping between a record in a CRM and the target customer. Current methods to detect and de-duplicate records use traditional Natural Language Processing techniques known as Entity Matching. In this paper we show how using the latest advancements in Large Language Models and Generative AI can vastly improve the identification and repair of duplicated records. On common benchmark datasets we find an improvement in the accuracy of data de-duplication rates from 30 percent using NLP techniques to almost 60 percent using our proposed method.
Paper Structure (27 sections, 6 figures, 6 tables, 1 algorithm)

This paper contains 27 sections, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: The five steps of a typical duplicate detection pipeline based on pairwise record comparisons
  • Figure 2: The steps to creating the embedding vector
  • Figure 3: Experimental results epsilon against F-score
  • Figure 4: Match Groups for the Musicbrainz 200K dataset (epsilon=0.245)
  • Figure 5: 2D UMAP Musicbrainz 200K nearest neighbour plot
  • ...and 1 more figures