On Leveraging Large Language Models for Enhancing Entity Resolution: A Cost-efficient Approach

Huahang Li; Longyu Feng; Shuangyin Li; Fei Hao; Chen Jason Zhang; Yuanfeng Song

On Leveraging Large Language Models for Enhancing Entity Resolution: A Cost-efficient Approach

Huahang Li, Longyu Feng, Shuangyin Li, Fei Hao, Chen Jason Zhang, Yuanfeng Song

TL;DR

The paper tackles cost-efficient entity resolution by introducing an uncertainty reduction framework that uses LLMs to verify a strategically chosen set of matching questions. By modeling possible partitions with probabilities and entropy, it shows that expected uncertainty reduction equals the joint entropy of possible answers, and it tackles MQ selection as a budgeted, NP-hard problem with a greedy, submodular-optimization approach. It also incorporates error-tolerant updates to handle imperfect LLM responses and a dynamic mechanism to converge toward correct partitions. Experiments on real datasets with budgeted LLM querying demonstrate improved uncertainty reduction and practical cost savings, highlighting the framework's applicability in scalable ER tasks.

Abstract

Entity resolution, the task of identifying and merging records that refer to the same real-world entity, is crucial in sectors like e-commerce, healthcare, and law enforcement. Large Language Models (LLMs) introduce an innovative approach to this task, capitalizing on their advanced linguistic capabilities and a ``pay-as-you-go'' model that provides significant advantages to those without extensive data science expertise. However, current LLMs are costly due to per-API request billing. Existing methods often either lack quality or become prohibitively expensive at scale. To address these problems, we propose an uncertainty reduction framework using LLMs to improve entity resolution results. We first initialize possible partitions of the entity cluster, refer to the same entity, and define the uncertainty of the result. Then, we reduce the uncertainty by selecting a few valuable matching questions for LLM verification. Upon receiving the answers, we update the probability distribution of the possible partitions. To further reduce costs, we design an efficient algorithm to judiciously select the most valuable matching pairs to query. Additionally, we create error-tolerant techniques to handle LLM mistakes and a dynamic adjustment method to reach truly correct partitions. Experimental results show that our method is efficient and effective, offering promising applications in real-world tasks.

On Leveraging Large Language Models for Enhancing Entity Resolution: A Cost-efficient Approach

TL;DR

Abstract

Paper Structure (13 sections, 1 theorem, 20 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 13 sections, 1 theorem, 20 equations, 4 figures, 4 tables, 1 algorithm.

Introduction
Preliminary Knowledge
Uncertainty Reduction Framework
Probability Distribution of Possible Parition
Expected Uncertainty Reduction
Strategy for MQs Selection
Adjustment with LLMs Response
Experiments
Experimental Setup
Evaluation on LLM
Data Quality
Related Works
Conclusion

Key Result

Theorem 1

The MQsSP is NP-hard.

Figures (4)

Figure 1: A real case of our uncertainty reduction framework: a). shows the initial probability distribution of possible partitions; b), c). shows the probability distribution after verifying the MQs adjusted by random selection; d), e). shows the probability distribution after verifying the MQs adjusted by greedy selection. We annotate the uncertainty of each state and the cost used in each step.
Figure 2: The abilities of various LLMs on datasets in different fields. All results are tested three times and averaged.
Figure 3: Random Selection v.s. Greedy Approximation Selection with 1k & 2k Budget-Constraint.
Figure 4: Data Quality with Budget Constraint.

Theorems & Definitions (9)

Definition 1: Possible Partition
Remark
Definition 2: Result Set
Definition 3: Uncertainty of Result
Definition 4: Matching Pair
Definition 5: Matching Question
Definition 6: Cost of MQ
Remark
Theorem 1

On Leveraging Large Language Models for Enhancing Entity Resolution: A Cost-efficient Approach

TL;DR

Abstract

On Leveraging Large Language Models for Enhancing Entity Resolution: A Cost-efficient Approach

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (9)