Duplicate Detection with GenAI
Ian Ormesher
TL;DR
This paper tackles the problem of duplicate detection in CRM data by reframing entity matching around embedding-based representations generated from match sentences and using DBSCAN for clustering. It replaces the traditional candidate-generation step with a preprocessing-free workflow that leverages pretrained large language model embeddings ($768$-dimensional) to form clusters via cosine similarity, achieving substantial improvements over baseline NLP approaches on benchmark datasets (up to roughly $60\%$ de-duplication accuracy). Key contributions include a practical, training-free pipeline, language-aware matching, and demonstrating robustness across datasets (Customer Data and Musicbrainz) with visualisation of clustering via UMAP. The approach promises practical impact in data quality, user engagement, and easy adaptability to future, better embedding models.
Abstract
Customer data is often stored as records in Customer Relations Management systems (CRMs). Data which is manually entered into such systems by one of more users over time leads to data replication, partial duplication or fuzzy duplication. This in turn means that there no longer a single source of truth for customers, contacts, accounts, etc. Downstream business processes become increasing complex and contrived without a unique mapping between a record in a CRM and the target customer. Current methods to detect and de-duplicate records use traditional Natural Language Processing techniques known as Entity Matching. In this paper we show how using the latest advancements in Large Language Models and Generative AI can vastly improve the identification and repair of duplicated records. On common benchmark datasets we find an improvement in the accuracy of data de-duplication rates from 30 percent using NLP techniques to almost 60 percent using our proposed method.
