BoostER: Leveraging Large Language Models for Enhancing Entity Resolution
Huahang Li, Shuangyin Li, Fei Hao, Chen Jason Zhang, Yuanfeng Song, Lei Chen
TL;DR
The paper addresses the cost-sensitive challenge of entity resolution on noisy Web data by leveraging Large Language Models as a service to verify and refine candidate links. It proposes BoostER, which builds a probabilistic partition over candidate links from base ER results and uses a token-aware, entropy-driven greedy algorithm to select LLM-verification questions under a budget, updating with Bayesian reasoning. LLM responses are incorporated via Bayes updates with an estimated capability $\Theta$, yielding entropy reduction and more precise partitions (e.g., entropy $H$ and joint entropy $D_A$ with an approximation ratio $1-1/e$). The demonstration shows practical usability for small-scale users and outlines future improvements in prompting strategies, highlighting a cost-effective path to higher-quality ER without extensive model training.
Abstract
Entity resolution, which involves identifying and merging records that refer to the same real-world entity, is a crucial task in areas like Web data integration. This importance is underscored by the presence of numerous duplicated and multi-version data resources on the Web. However, achieving high-quality entity resolution typically demands significant effort. The advent of Large Language Models (LLMs) like GPT-4 has demonstrated advanced linguistic capabilities, which can be a new paradigm for this task. In this paper, we propose a demonstration system named BoostER that examines the possibility of leveraging LLMs in the entity resolution process, revealing advantages in both easy deployment and low cost. Our approach optimally selects a set of matching questions and poses them to LLMs for verification, then refines the distribution of entity resolution results with the response of LLMs. This offers promising prospects to achieve a high-quality entity resolution result for real-world applications, especially to individuals or small companies without the need for extensive model training or significant financial investment.
