Table of Contents
Fetching ...

APrompt4EM: Augmented Prompt Tuning for Generalized Entity Matching

Yikuan Xia, Jiazun Chen, Xinchi Li, Jun Gao

TL;DR

This work tackles Generalized Entity Matching in low-resource settings by proposing APrompt4EM, an augmented prompt-tuning framework that combines natural language prompts with a contextualized soft-token mechanism to better extract key information from diverse data formats. It introduces an information augmentation component that leverages LLMs to supplement missing or ambiguous attributes, controlled by an uncertainty-based strategy to reduce token costs. Empirically, the basic APrompt4EM model yields about a $+5.24\%$ average improvement in F1 over strong baselines with moderate-size PLMs, and the augmented approach can match fine-tuned LLM performance at under $14\%$ API cost, validated across twelve real-world GEM/EM datasets. The combination of natural-language prompts, instance-specific soft tokens, and cost-aware information augmentation offers a practical, scalable solution for GEM in heterogeneous, data-noisy environments.

Abstract

Generalized Entity Matching (GEM), which aims at judging whether two records represented in different formats refer to the same real-world entity, is an essential task in data management. The prompt tuning paradigm for pre-trained language models (PLMs), including the recent PromptEM model, effectively addresses the challenges of low-resource GEM in practical applications, offering a robust solution when labeled data is scarce. However, existing prompt tuning models for GEM face the challenges of prompt design and information gap. This paper introduces an augmented prompt tuning framework for the challenges, which consists of two main improvements. The first is an augmented contextualized soft token-based prompt tuning method that extracts a guiding soft token benefit for the PLMs' prompt tuning, and the second is a cost-effective information augmentation strategy leveraging large language models (LLMs). Our approach performs well on the low-resource GEM challenges. Extensive experiments show promising advancements of our basic model without information augmentation over existing methods based on moderate-size PLMs (average 5.24%+), and our model with information augmentation achieves comparable performance compared with fine-tuned LLMs, using less than 14% of the API fee.

APrompt4EM: Augmented Prompt Tuning for Generalized Entity Matching

TL;DR

This work tackles Generalized Entity Matching in low-resource settings by proposing APrompt4EM, an augmented prompt-tuning framework that combines natural language prompts with a contextualized soft-token mechanism to better extract key information from diverse data formats. It introduces an information augmentation component that leverages LLMs to supplement missing or ambiguous attributes, controlled by an uncertainty-based strategy to reduce token costs. Empirically, the basic APrompt4EM model yields about a average improvement in F1 over strong baselines with moderate-size PLMs, and the augmented approach can match fine-tuned LLM performance at under API cost, validated across twelve real-world GEM/EM datasets. The combination of natural-language prompts, instance-specific soft tokens, and cost-aware information augmentation offers a practical, scalable solution for GEM in heterogeneous, data-noisy environments.

Abstract

Generalized Entity Matching (GEM), which aims at judging whether two records represented in different formats refer to the same real-world entity, is an essential task in data management. The prompt tuning paradigm for pre-trained language models (PLMs), including the recent PromptEM model, effectively addresses the challenges of low-resource GEM in practical applications, offering a robust solution when labeled data is scarce. However, existing prompt tuning models for GEM face the challenges of prompt design and information gap. This paper introduces an augmented prompt tuning framework for the challenges, which consists of two main improvements. The first is an augmented contextualized soft token-based prompt tuning method that extracts a guiding soft token benefit for the PLMs' prompt tuning, and the second is a cost-effective information augmentation strategy leveraging large language models (LLMs). Our approach performs well on the low-resource GEM challenges. Extensive experiments show promising advancements of our basic model without information augmentation over existing methods based on moderate-size PLMs (average 5.24%+), and our model with information augmentation achieves comparable performance compared with fine-tuned LLMs, using less than 14% of the API fee.
Paper Structure (13 sections, 12 equations, 4 figures, 3 tables)

This paper contains 13 sections, 12 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Different data schemas in GEM and unifying them using natural language texts. The entity presented by semi-structured and relational data is a typical RAM product in 2022, while the entity presented in text is a typical RAM product in 2014.
  • Figure 2: Illustration of our APrompt4EM framework.
  • Figure 3: Example of different data augmentation operators. The shuffle operator from Ditto doesn't provide additional information, while the augmentation from GPT 3.5 can supplement new information (red part), and extract information in structured forms from the description (blue part).
  • Figure :