Table of Contents
Fetching ...

BEACON: Budget-Aware Entity Matching Across Domains (Extended Technical Report)

Nicholas Pulsone, Roee Shraga, Gregory Goren

Abstract

Entity Matching (EM)--the task of determining whether two data records refer to the same real-world entity--is a core task in data integration. Recent advances in deep learning have set a new standard for EM, particularly through fine-tuning Pretrained Language Models (PLMs) and, more recently, Large Language Models (LLMs). However, fine-tuning typically requires large amounts of labeled data, which are expensive and time-consuming to obtain. In the context of e-commerce matching, labeling scarcity varies widely across domains, raising the question of how to intelligently train accurate domain-specific EM models with limited labeled data. In this work we assume users have only limited amount of labels for a specific target domain but have access to labeled data from other domains. We introduce BEACON, a distribution-aware, budget-aware framework for low-resource EM across domains. BEACON leverages the insight that embedding representations of pairwise candidate matches can guide the effective selection of out-of-domain samples under limited in-domain supervision. We conduct extensive experiments across multiple domain-partitioned datasets derived from established EM benchmarks, demonstrating that BEACON consistently outperforms state-of-the-art methods under different training budgets.

BEACON: Budget-Aware Entity Matching Across Domains (Extended Technical Report)

Abstract

Entity Matching (EM)--the task of determining whether two data records refer to the same real-world entity--is a core task in data integration. Recent advances in deep learning have set a new standard for EM, particularly through fine-tuning Pretrained Language Models (PLMs) and, more recently, Large Language Models (LLMs). However, fine-tuning typically requires large amounts of labeled data, which are expensive and time-consuming to obtain. In the context of e-commerce matching, labeling scarcity varies widely across domains, raising the question of how to intelligently train accurate domain-specific EM models with limited labeled data. In this work we assume users have only limited amount of labels for a specific target domain but have access to labeled data from other domains. We introduce BEACON, a distribution-aware, budget-aware framework for low-resource EM across domains. BEACON leverages the insight that embedding representations of pairwise candidate matches can guide the effective selection of out-of-domain samples under limited in-domain supervision. We conduct extensive experiments across multiple domain-partitioned datasets derived from established EM benchmarks, demonstrating that BEACON consistently outperforms state-of-the-art methods under different training budgets.
Paper Structure (30 sections, 17 equations, 8 figures, 8 tables)

This paper contains 30 sections, 17 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Performing EM between two e-commerce datasets. Each sample compares two product records and is derived from the Web Data Commons Multi-Dimensional Entity Matching Benchmark (WDC) peeters2023wdc.
  • Figure 2: Entity matching samples from Figure \ref{['fig:em_example']} grouped by the product category domain. The goal is to select a pairwise sample to complement the selected training set for EM in the "Computers" product category.
  • Figure 3: Effect of Oversampling on PLM-based Entity Matching. We fine-tune RoBERTa on two small product domains--Automotive and Clothing--from the WDC dataset using progressively larger training budgets that permit oversampling.
  • Figure 4: Illustration of the Train–Validation Distribution Fitting (TVDF) sampling procedure. (a) Validation set embeddings with centroid $\mu_{\text{val}}$ (orange star). (b) In-domain training embeddings for domain $i \in \mathcal{D}$ with centroid $\mu_i$ (blue star) and overlaid $\mu_{\text{val}}$. (c) Out-of-domain samples are ranked by the TVDF selector based on their contribution to aligning $\mu_i$ with $\mu_{\text{val}}$. (d) Selected samples are added to the training set, resulting in closer alignment between the training and validation centroids.
  • Figure 5: Illustration of the K-Center Greedy (KCG) sampling procedure. (a) In-domain embeddings $n_i$ serve as the initial centers. (b) Out-of-domain samples are ranked by the KCG selector based on their distance to the nearest in-domain center. (c) Selected samples are added to the training set to fill gaps in the embedding space, increasing overall coverage.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Example 1.1
  • Example 1.2
  • Example 3.1
  • Definition 1: Budget-Aware Entity Matching Across Domains (EMAD)