Table of Contents
Fetching ...

Learning from Natural Language Explanations for Generalizable Entity Matching

Somin Wadhwa, Adit Krishnan, Runhui Wang, Byron C. Wallace, Chris Kong

TL;DR

The paper tackles the generalization gap in entity matching by reframing the task as conditional text generation and distilling the reasoning of large language models into compact seq2seq models. It demonstrates that training small models with explanation-augmented data (explanations generated by LLMs) substantially improves out-of-domain performance across cross-domain, cross-schema, and cross-distribution settings, outperforming traditional domain-adaptation approaches in several cases. Comprehensive ablations show that the content and quality of explanations matter for robustness and performance, though some factuality and intrinsic error challenges remain. The approach offers a scalable, cost-efficient path to robust entity matching across diverse domains, with practical implications for real-world data integration tasks and privacy considerations when using external LLMs for training data. Overall, explanation-guided distillation emerges as a promising strategy to achieve strong generalization without relying on expensive inference from giant LLMs at deployment.

Abstract

Entity matching is the task of linking records from different sources that refer to the same real-world entity. Past work has primarily treated entity linking as a standard supervised learning problem. However, supervised entity matching models often do not generalize well to new data, and collecting exhaustive labeled training data is often cost prohibitive. Further, recent efforts have adopted LLMs for this task in few/zero-shot settings, exploiting their general knowledge. But LLMs are prohibitively expensive for performing inference at scale for real-world entity matching tasks. As an efficient alternative, we re-cast entity matching as a conditional generation task as opposed to binary classification. This enables us to "distill" LLM reasoning into smaller entity matching models via natural language explanations. This approach achieves strong performance, especially on out-of-domain generalization tests (10.85% F-1) where standalone generative methods struggle. We perform ablations that highlight the importance of explanations, both for performance and model robustness.

Learning from Natural Language Explanations for Generalizable Entity Matching

TL;DR

The paper tackles the generalization gap in entity matching by reframing the task as conditional text generation and distilling the reasoning of large language models into compact seq2seq models. It demonstrates that training small models with explanation-augmented data (explanations generated by LLMs) substantially improves out-of-domain performance across cross-domain, cross-schema, and cross-distribution settings, outperforming traditional domain-adaptation approaches in several cases. Comprehensive ablations show that the content and quality of explanations matter for robustness and performance, though some factuality and intrinsic error challenges remain. The approach offers a scalable, cost-efficient path to robust entity matching across diverse domains, with practical implications for real-world data integration tasks and privacy considerations when using external LLMs for training data. Overall, explanation-guided distillation emerges as a promising strategy to achieve strong generalization without relying on expensive inference from giant LLMs at deployment.

Abstract

Entity matching is the task of linking records from different sources that refer to the same real-world entity. Past work has primarily treated entity linking as a standard supervised learning problem. However, supervised entity matching models often do not generalize well to new data, and collecting exhaustive labeled training data is often cost prohibitive. Further, recent efforts have adopted LLMs for this task in few/zero-shot settings, exploiting their general knowledge. But LLMs are prohibitively expensive for performing inference at scale for real-world entity matching tasks. As an efficient alternative, we re-cast entity matching as a conditional generation task as opposed to binary classification. This enables us to "distill" LLM reasoning into smaller entity matching models via natural language explanations. This approach achieves strong performance, especially on out-of-domain generalization tests (10.85% F-1) where standalone generative methods struggle. We perform ablations that highlight the importance of explanations, both for performance and model robustness.
Paper Structure (34 sections, 1 equation, 4 figures, 5 tables)

This paper contains 34 sections, 1 equation, 4 figures, 5 tables.

Figures (4)

  • Figure 1: An example of the generalization problem in entity matching: A model trained on a dataset of computers (e.g., WDC-Computers) is tested on instances taken from a corpus comprising shoes (WDC-Shoes).
  • Figure 2: We propose augmenting binary labeled (BL) training data of entity matching datasets with Chain-of-Thought style natural language explanations from large models before fine-tuning smaller, more robust generative models. We use the time needed to generate explanation-augmented (EA) training data on a typical Amazon EC2 P3 instance as a proxy for cost in case of Mistral jiang2023mistral and Alpaca alpaca models, and the total cost of OpenAI's API usage in case of GPT-* models. Using this approach, we realize significant performance gains in a variety of out-of-domain test settings.
  • Figure 3: Average F1 on out-of-domain test data when training data is ablated under varying conditions.
  • Figure 4: Interface to conduct Test of Factuality annotations on instances taken from the Abt-Buy dataset. Each model-generated (Mistral-7B; jiang2023mistral) explanation is tested for intrinsic and extrinsic errors.