Table of Contents
Fetching ...

AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model

Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter

TL;DR

This paper addresses zero-shot entity matching by introducing AnyMatch, a compact GPT-2–based model fine-tuned via transfer learning on carefully generated, schema-agnostic data from multiple transfer datasets. The approach relies on a robust data-generation pipeline, including AutoML-driven difficult-pair selection, attribute-level augmentation, and label-imbalance control, paired with a serialization format that requires no column names or types. Empirical results show AnyMatch achieves the second-best average F1 across nine benchmarks, with strong performance on diverse domains, while offering massive cost and throughput advantages over trillion-parameter LLM-based matchers. The work demonstrates that cost-efficient small-model EM can approach state-of-the-art accuracy, enabling scalable deployment in data integration and deduplication workflows, with potential for hybrid use with large LLMs in the future.

Abstract

Entity matching (EM) is the problem of determining whether two records refer to same real-world entity, which is crucial in data integration, e.g., for product catalogs or address databases. A major drawback of many EM approaches is their dependence on labelled examples. We thus focus on the challenging setting of zero-shot entity matching where no labelled examples are available for an unseen target dataset. Recently, large language models (LLMs) have shown promising results for zero-shot EM, but their low throughput and high deployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model fine-tuned in a transfer learning setup. We propose several novel data selection techniques to generate fine-tuning data for our model, e.g., by selecting difficult pairs to match via an AutoML filter, by generating additional attribute-level examples, and by controlling label imbalance in the data. We conduct an extensive evaluation of the prediction quality and deployment cost of our model, in a comparison to thirteen baselines on nine benchmark datasets. We find that AnyMatch provides competitive prediction quality despite its small parameter size: it achieves the second-highest F1 score overall, and outperforms several other approaches that employ models with hundreds of billions of parameters. Furthermore, our approach exhibits major cost benefits: the average prediction quality of AnyMatch is within 4.4% of the state-of-the-art method MatchGPT with the proprietary trillion-parameter model GPT-4, yet AnyMatch requires four orders of magnitude less parameters and incurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).

AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model

TL;DR

This paper addresses zero-shot entity matching by introducing AnyMatch, a compact GPT-2–based model fine-tuned via transfer learning on carefully generated, schema-agnostic data from multiple transfer datasets. The approach relies on a robust data-generation pipeline, including AutoML-driven difficult-pair selection, attribute-level augmentation, and label-imbalance control, paired with a serialization format that requires no column names or types. Empirical results show AnyMatch achieves the second-best average F1 across nine benchmarks, with strong performance on diverse domains, while offering massive cost and throughput advantages over trillion-parameter LLM-based matchers. The work demonstrates that cost-efficient small-model EM can approach state-of-the-art accuracy, enabling scalable deployment in data integration and deduplication workflows, with potential for hybrid use with large LLMs in the future.

Abstract

Entity matching (EM) is the problem of determining whether two records refer to same real-world entity, which is crucial in data integration, e.g., for product catalogs or address databases. A major drawback of many EM approaches is their dependence on labelled examples. We thus focus on the challenging setting of zero-shot entity matching where no labelled examples are available for an unseen target dataset. Recently, large language models (LLMs) have shown promising results for zero-shot EM, but their low throughput and high deployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model fine-tuned in a transfer learning setup. We propose several novel data selection techniques to generate fine-tuning data for our model, e.g., by selecting difficult pairs to match via an AutoML filter, by generating additional attribute-level examples, and by controlling label imbalance in the data. We conduct an extensive evaluation of the prediction quality and deployment cost of our model, in a comparison to thirteen baselines on nine benchmark datasets. We find that AnyMatch provides competitive prediction quality despite its small parameter size: it achieves the second-highest F1 score overall, and outperforms several other approaches that employ models with hundreds of billions of parameters. Furthermore, our approach exhibits major cost benefits: the average prediction quality of AnyMatch is within 4.4% of the state-of-the-art method MatchGPT with the proprietary trillion-parameter model GPT-4, yet AnyMatch requires four orders of magnitude less parameters and incurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).
Paper Structure (19 sections, 3 figures, 5 tables, 3 algorithms)

This paper contains 19 sections, 3 figures, 5 tables, 3 algorithms.

Figures (3)

  • Figure 1: AnyMatch offers competitive zero-shot entity matching performance at a low cost. Its average F1 score is only 4.4% lower than the score of the state-of-the-art method MatchGPT with OpenAI's GPT-4 model, which requires a trillion parameter LLM at a 3,899x higher cost per 1,000 tokens.
  • Figure 2: High-level overview of AnyMatch -- (1) We generate fine-tuning data from the available labelled datasets by applying several data selection techniques, and (2) fine-tune a language model as zero-shot entity matcher. (3) We use the resulting matcher for inference on unseen target data at deployment time.
  • Figure 3: AnyMatch outperforms the majority of models applied by MatchGPT, even though they have up to three orders of magnitude more parameters (a), and offers an attractive trade-off at a 3,899x better price than MatchGPT using GPT-4 with only a 4.4% decrease in prediction quality (b).