Disambiguate Entity Matching using Large Language Models through Relation Discovery
Zezhou Huang
TL;DR
The paper tackles the ambiguity in entity matching when integrating with external databases that vary in granularity. It introduces a relation-based EM framework that predefinedly specifies a set of relations (e.g., Exactly the same, General without details, Similar with Additional Details) to guide matching, rather than relying solely on semantic similarity. The approach combines offline embeddings (e.g., using $ada-002$ and a Faiss index) with online Retrieval Augmented Generation, where an LLM analyzes candidate matches under each relation, aided by a chain-of-thought reasoning process and a human-in-the-loop for final decisions. Demonstrated in ESG reporting tasks, the method reduces manual effort and improves interpretability, while revealing domain-knowledge gaps that motivate iterative refinement of relations and better HIL design for high-stakes downstream tasks.
Abstract
Entity matching is a critical challenge in data integration and cleaning, central to tasks like fuzzy joins and deduplication. Traditional approaches have focused on overcoming fuzzy term representations through methods such as edit distance, Jaccard similarity, and more recently, embeddings and deep neural networks, including advancements from large language models (LLMs) like GPT. However, the core challenge in entity matching extends beyond term fuzziness to the ambiguity in defining what constitutes a "match," especially when integrating with external databases. This ambiguity arises due to varying levels of detail and granularity among entities, complicating exact matches. We propose a novel approach that shifts focus from purely identifying semantic similarities to understanding and defining the "relations" between entities as crucial for resolving ambiguities in matching. By predefining a set of relations relevant to the task at hand, our method allows analysts to navigate the spectrum of similarity more effectively, from exact matches to conceptually related entities.
