Disambiguate Entity Matching using Large Language Models through Relation Discovery

Zezhou Huang

Disambiguate Entity Matching using Large Language Models through Relation Discovery

Zezhou Huang

TL;DR

The paper tackles the ambiguity in entity matching when integrating with external databases that vary in granularity. It introduces a relation-based EM framework that predefinedly specifies a set of relations (e.g., Exactly the same, General without details, Similar with Additional Details) to guide matching, rather than relying solely on semantic similarity. The approach combines offline embeddings (e.g., using $ada-002$ and a Faiss index) with online Retrieval Augmented Generation, where an LLM analyzes candidate matches under each relation, aided by a chain-of-thought reasoning process and a human-in-the-loop for final decisions. Demonstrated in ESG reporting tasks, the method reduces manual effort and improves interpretability, while revealing domain-knowledge gaps that motivate iterative refinement of relations and better HIL design for high-stakes downstream tasks.

Abstract

Entity matching is a critical challenge in data integration and cleaning, central to tasks like fuzzy joins and deduplication. Traditional approaches have focused on overcoming fuzzy term representations through methods such as edit distance, Jaccard similarity, and more recently, embeddings and deep neural networks, including advancements from large language models (LLMs) like GPT. However, the core challenge in entity matching extends beyond term fuzziness to the ambiguity in defining what constitutes a "match," especially when integrating with external databases. This ambiguity arises due to varying levels of detail and granularity among entities, complicating exact matches. We propose a novel approach that shifts focus from purely identifying semantic similarities to understanding and defining the "relations" between entities as crucial for resolving ambiguities in matching. By predefining a set of relations relevant to the task at hand, our method allows analysts to navigate the spectrum of similarity more effectively, from exact matches to conceptually related entities.

Disambiguate Entity Matching using Large Language Models through Relation Discovery

TL;DR

and a Faiss index) with online Retrieval Augmented Generation, where an LLM analyzes candidate matches under each relation, aided by a chain-of-thought reasoning process and a human-in-the-loop for final decisions. Demonstrated in ESG reporting tasks, the method reduces manual effort and improves interpretability, while revealing domain-knowledge gaps that motivate iterative refinement of relations and better HIL design for high-stakes downstream tasks.

Abstract

Paper Structure (8 sections, 2 equations, 3 figures)

This paper contains 8 sections, 2 equations, 3 figures.

Introduction
Approach Overview
Problem Definition
System Design and Usage Walkthrough
Offline
Online Phase
User Study
Conclusion

Figures (3)

Figure 1: Entity Matching for ESG emission factor.
Figure 2: System Design that performs relation-based entity matching for high-stake tasks like ESG reporting.
Figure 3: Generated report detailing the matched entities with respect to various relations, and their explanations, used by humans to perform downstream high-stakes tasks.

Theorems & Definitions (4)

Example 1
Example 2
Example 3
Example 4

Disambiguate Entity Matching using Large Language Models through Relation Discovery

TL;DR

Abstract

Disambiguate Entity Matching using Large Language Models through Relation Discovery

Authors

TL;DR

Abstract

Table of Contents

Figures (3)

Theorems & Definitions (4)