Table of Contents
Fetching ...

BoostER: Leveraging Large Language Models for Enhancing Entity Resolution

Huahang Li, Shuangyin Li, Fei Hao, Chen Jason Zhang, Yuanfeng Song, Lei Chen

TL;DR

The paper addresses the cost-sensitive challenge of entity resolution on noisy Web data by leveraging Large Language Models as a service to verify and refine candidate links. It proposes BoostER, which builds a probabilistic partition over candidate links from base ER results and uses a token-aware, entropy-driven greedy algorithm to select LLM-verification questions under a budget, updating with Bayesian reasoning. LLM responses are incorporated via Bayes updates with an estimated capability $\Theta$, yielding entropy reduction and more precise partitions (e.g., entropy $H$ and joint entropy $D_A$ with an approximation ratio $1-1/e$). The demonstration shows practical usability for small-scale users and outlines future improvements in prompting strategies, highlighting a cost-effective path to higher-quality ER without extensive model training.

Abstract

Entity resolution, which involves identifying and merging records that refer to the same real-world entity, is a crucial task in areas like Web data integration. This importance is underscored by the presence of numerous duplicated and multi-version data resources on the Web. However, achieving high-quality entity resolution typically demands significant effort. The advent of Large Language Models (LLMs) like GPT-4 has demonstrated advanced linguistic capabilities, which can be a new paradigm for this task. In this paper, we propose a demonstration system named BoostER that examines the possibility of leveraging LLMs in the entity resolution process, revealing advantages in both easy deployment and low cost. Our approach optimally selects a set of matching questions and poses them to LLMs for verification, then refines the distribution of entity resolution results with the response of LLMs. This offers promising prospects to achieve a high-quality entity resolution result for real-world applications, especially to individuals or small companies without the need for extensive model training or significant financial investment.

BoostER: Leveraging Large Language Models for Enhancing Entity Resolution

TL;DR

The paper addresses the cost-sensitive challenge of entity resolution on noisy Web data by leveraging Large Language Models as a service to verify and refine candidate links. It proposes BoostER, which builds a probabilistic partition over candidate links from base ER results and uses a token-aware, entropy-driven greedy algorithm to select LLM-verification questions under a budget, updating with Bayesian reasoning. LLM responses are incorporated via Bayes updates with an estimated capability , yielding entropy reduction and more precise partitions (e.g., entropy and joint entropy with an approximation ratio ). The demonstration shows practical usability for small-scale users and outlines future improvements in prompting strategies, highlighting a cost-effective path to higher-quality ER without extensive model training.

Abstract

Entity resolution, which involves identifying and merging records that refer to the same real-world entity, is a crucial task in areas like Web data integration. This importance is underscored by the presence of numerous duplicated and multi-version data resources on the Web. However, achieving high-quality entity resolution typically demands significant effort. The advent of Large Language Models (LLMs) like GPT-4 has demonstrated advanced linguistic capabilities, which can be a new paradigm for this task. In this paper, we propose a demonstration system named BoostER that examines the possibility of leveraging LLMs in the entity resolution process, revealing advantages in both easy deployment and low cost. Our approach optimally selects a set of matching questions and poses them to LLMs for verification, then refines the distribution of entity resolution results with the response of LLMs. This offers promising prospects to achieve a high-quality entity resolution result for real-world applications, especially to individuals or small companies without the need for extensive model training or significant financial investment.
Paper Structure (7 sections, 2 equations, 4 figures, 2 tables)

This paper contains 7 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: An illustration of Possible Matches (linkages) in Table \ref{['tab:partitions']}. The Probability of each linkage is the cumulative sum of its occurrences across Possible Partitions.
  • Figure 2: The Workflow of BoostER.
  • Figure 3: BoostER Demo
  • Figure :