Table of Contents
Fetching ...

KcMF: A Knowledge-compliant Framework for Schema and Entity Matching with Fine-tuning-free LLMs

Yongqin Xu, Huan Li, Ke Chen, Lidan Shou

TL;DR

KcMF tackles schema matching and entity matching with a tuning-free, knowledge-enhanced framework that uses a once-and-for-all pseudo-code-based reasoning design, augmented by external knowledge from Dataset as Knowledge (DaK) and Examples as Knowledge (EaK). It introduces Self-Indicator Extraction and Summarized Demonstrations to improve prompt quality, and an Inconsistency-tolerant Generation Ensemble (IntGE) to robustly combine outputs from multiple knowledge sources. Empirical results across SM and EM show consistent improvements across five LLM backbones and competitive performance versus fine-tuned baselines, with strong generalization to out-of-domain data. The work offers a practical pathway for deploying LLM-based data matching without fine-tuning, particularly in privacy-sensitive or cross-domain settings.

Abstract

Schema matching (SM) and entity matching (EM) tasks are crucial for data integration. While large language models (LLMs) have shown promising results in these tasks, they suffer from hallucinations and confusion about task instructions. This study presents the Knowledge-Compliant Matching Framework (KcMF), an LLM-based approach that addresses these issues without the need for domain-specific fine-tuning. KcMF employs a once-and-for-all pseudo-code-based task decomposition strategy to adopt natural language statements that guide LLM reasoning and reduce confusion across various task types. We also propose two mechanisms, Dataset as Knowledge (DaK) and Example as Knowledge (EaK), to build domain knowledge sets when unstructured domain knowledge is lacking. Moreover, we introduce a result-ensemble strategy to leverage multiple knowledge sources and suppress badly formatted outputs. Extensive evaluations confirm that KcMF clearly enhances five LLM backbones in both SM and EM tasks while outperforming the non-LLM competitors by an average F1-score of 17.93%.

KcMF: A Knowledge-compliant Framework for Schema and Entity Matching with Fine-tuning-free LLMs

TL;DR

KcMF tackles schema matching and entity matching with a tuning-free, knowledge-enhanced framework that uses a once-and-for-all pseudo-code-based reasoning design, augmented by external knowledge from Dataset as Knowledge (DaK) and Examples as Knowledge (EaK). It introduces Self-Indicator Extraction and Summarized Demonstrations to improve prompt quality, and an Inconsistency-tolerant Generation Ensemble (IntGE) to robustly combine outputs from multiple knowledge sources. Empirical results across SM and EM show consistent improvements across five LLM backbones and competitive performance versus fine-tuned baselines, with strong generalization to out-of-domain data. The work offers a practical pathway for deploying LLM-based data matching without fine-tuning, particularly in privacy-sensitive or cross-domain settings.

Abstract

Schema matching (SM) and entity matching (EM) tasks are crucial for data integration. While large language models (LLMs) have shown promising results in these tasks, they suffer from hallucinations and confusion about task instructions. This study presents the Knowledge-Compliant Matching Framework (KcMF), an LLM-based approach that addresses these issues without the need for domain-specific fine-tuning. KcMF employs a once-and-for-all pseudo-code-based task decomposition strategy to adopt natural language statements that guide LLM reasoning and reduce confusion across various task types. We also propose two mechanisms, Dataset as Knowledge (DaK) and Example as Knowledge (EaK), to build domain knowledge sets when unstructured domain knowledge is lacking. Moreover, we introduce a result-ensemble strategy to leverage multiple knowledge sources and suppress badly formatted outputs. Extensive evaluations confirm that KcMF clearly enhances five LLM backbones in both SM and EM tasks while outperforming the non-LLM competitors by an average F1-score of 17.93%.

Paper Structure

This paper contains 55 sections, 3 equations, 12 figures, 27 tables, 2 algorithms.

Figures (12)

  • Figure 1: Three common issues in LLM-based data matching tasks and an overview of the enhancement will be discussed in this study: As demonstrated at the bottom, with pseudo-code and retrieved knowledge, by going through statements from the former, the LLM are able to reject the match between patient-id and drug-id.
  • Figure 2: Overview of KcMF. Our carefully designed pseudo-code (detailed in Appendix \ref{['appendix-section:implementation']}) offers a reusable and efficient solution for both SM and EM tasks. This eliminates the need for redesigning statements from scratch when working with new datasets, streamlining the process and enhancing adaptability.
  • Figure 3: A toy example of DaK. Object "provider" and the description of which are identified, respectively (① & ②); then, they are integrated as an entry and a piece of DaK knowledge (③).
  • Figure 4: An example of EaK.
  • Figure 5: A toy prompt combining all outcomes from previous sections and placeholders $\{\mathcal{C}_\text{RSNG}\}$ and $\{ans\}$.
  • ...and 7 more figures