Table of Contents
Fetching ...

Entity Matching using Large Language Models

Ralph Peeters, Aaron Steiner, Christian Bizer

TL;DR

Generative LLMs are evaluated as robust, data-efficient alternatives to PLMs for entity matching across zero-shot and data-assisted settings, using both hosted and open-source models. The study shows GPT-4 often achieves top zero-shot performance on product data, while open-source LLMs can approach this with appropriate prompts, demonstrations, or rules; fine-tuning also yields significant gains for several models. It further demonstrates that GPT-4 can generate structured explanations for matching decisions, enabling aggregation to derive system-wide insights, and that GPT-4-turbo can automatically identify and categorize error classes for automated error analysis. These findings support practical guidance on when to use PLMs versus LLMs, how to design prompts, and how to leverage explanations to improve entity matching pipelines under privacy and cost considerations.

Abstract

Entity matching is the task of deciding whether two entity descriptions refer to the same real-world entity. Entity matching is a central step in most data integration pipelines. Many state-of-the-art entity matching methods rely on pre-trained language models (PLMs) such as BERT or RoBERTa. Two major drawbacks of these models for entity matching are that (i) the models require significant amounts of task-specific training data and (ii) the fine-tuned models are not robust concerning out-of-distribution entities. This paper investigates using generative large language models (LLMs) as a less task-specific training data-dependent and more robust alternative to PLM-based matchers. The study covers hosted and open-source LLMs which can be run locally. We evaluate these models in a zero-shot scenario and a scenario where task-specific training data is available. We compare different prompt designs and the prompt sensitivity of the models. We show that there is no single best prompt but that the prompt needs to be tuned for each model/dataset combination. We further investigate (i) the selection of in-context demonstrations, (ii) the generation of matching rules, as well as (iii) fine-tuning LLMs using the same pool of training data. Our experiments show that the best LLMs require no or only a few training examples to perform comparably to PLMs that were fine-tuned using thousands of examples. LLM-based matchers further exhibit higher robustness to unseen entities. We show that GPT4 can generate structured explanations for matching decisions and can automatically identify potential causes of matching errors by analyzing explanations of wrong decisions. We demonstrate that the model can generate meaningful textual descriptions of the identified error classes, which can help data engineers to improve entity matching pipelines.

Entity Matching using Large Language Models

TL;DR

Generative LLMs are evaluated as robust, data-efficient alternatives to PLMs for entity matching across zero-shot and data-assisted settings, using both hosted and open-source models. The study shows GPT-4 often achieves top zero-shot performance on product data, while open-source LLMs can approach this with appropriate prompts, demonstrations, or rules; fine-tuning also yields significant gains for several models. It further demonstrates that GPT-4 can generate structured explanations for matching decisions, enabling aggregation to derive system-wide insights, and that GPT-4-turbo can automatically identify and categorize error classes for automated error analysis. These findings support practical guidance on when to use PLMs versus LLMs, how to design prompts, and how to leverage explanations to improve entity matching pipelines under privacy and cost considerations.

Abstract

Entity matching is the task of deciding whether two entity descriptions refer to the same real-world entity. Entity matching is a central step in most data integration pipelines. Many state-of-the-art entity matching methods rely on pre-trained language models (PLMs) such as BERT or RoBERTa. Two major drawbacks of these models for entity matching are that (i) the models require significant amounts of task-specific training data and (ii) the fine-tuned models are not robust concerning out-of-distribution entities. This paper investigates using generative large language models (LLMs) as a less task-specific training data-dependent and more robust alternative to PLM-based matchers. The study covers hosted and open-source LLMs which can be run locally. We evaluate these models in a zero-shot scenario and a scenario where task-specific training data is available. We compare different prompt designs and the prompt sensitivity of the models. We show that there is no single best prompt but that the prompt needs to be tuned for each model/dataset combination. We further investigate (i) the selection of in-context demonstrations, (ii) the generation of matching rules, as well as (iii) fine-tuning LLMs using the same pool of training data. Our experiments show that the best LLMs require no or only a few training examples to perform comparably to PLMs that were fine-tuned using thousands of examples. LLM-based matchers further exhibit higher robustness to unseen entities. We show that GPT4 can generate structured explanations for matching decisions and can automatically identify potential causes of matching errors by analyzing explanations of wrong decisions. We demonstrate that the model can generate meaningful textual descriptions of the identified error classes, which can help data engineers to improve entity matching pipelines.
Paper Structure (16 sections, 6 figures, 13 tables)

This paper contains 16 sections, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Example of prompting an LLM to match two entity descriptions.
  • Figure 2: Example of a prompt containing a positive and a negative demonstration before asking for a decision.
  • Figure 3: Example of a prompt containing handwritten matching rules for the product domain. A subset of the learned rules is depicted below.
  • Figure 4: Conversation instructing the model to match an entity pair and asking for a structured explanation of the decision. Top: Walmart-Amazon, bottom: DBLP-Scholar.
  • Figure 5: Prompt used for the automatic generation of error classes given false positives and false negatives.
  • ...and 1 more figures