Table of Contents
Fetching ...

Prompt-Matcher: Leveraging Large Models to Reduce Uncertainty in Schema Matching Results

Longyu Feng, Huahang Li, Chen Jason Zhang

TL;DR

Prompt-Matcher tackles uncertainty in schema matching by integrating an NP-hard, budget-constrained correspondence selection with LLM-based verification and probabilistic updates. The authors formalize the Correspondence Selection Problem, prove its NP-hardness, and deploy a $(1-1/e)$-approximation greedy algorithm, paired with two targeted prompts (Semantic-Match and Abbreviation-Match) for verification. Empirical results on DeepMDataset and Fabricated-Dataset show state-of-the-art verification performance and faster uncertainty reduction, with greedy selection often ranking the best schema match early under realistic budgets. The approach enables more reliable, cost-aware schema matching in data integration tasks and suggests avenues for enhancing efficiency and LLM self-consistency in future work.

Abstract

Schema matching is the process of identifying correspondences between the elements of two given schemata, essential for database management systems, data integration, and data warehousing. For datasets across different scenarios, the optimal schema matching algorithm is different. For single algorithm, hyperparameter tuning also cases multiple results. All results assigned equal probabilities are stored in probabilistic databases to facilitate uncertainty management. The substantial degree of uncertainty diminishes the efficiency and reliability of data processing, thereby precluding the provision of more accurate information for decision-makers. To address this problem, we introduce a new approach based on fine-grained correspondence verification with specific prompt of Large Language Model. Our approach is an iterative loop that consists of three main components: (1) the correspondence selection algorithm, (2) correspondence verification, and (3) the update of probability distribution. The core idea is that correspondences intersect across multiple results, thereby linking the verification of correspondences to the reduction of uncertainty in candidate results. The task of selecting an optimal correspondence set to maximize the anticipated uncertainty reduction within a fixed budgetary framework is established as an NP-hard problem. We propose a novel $(1-1/e)$-approximation algorithm that significantly outperforms brute algorithm in terms of computational efficiency. To enhance correspondence verification, we have developed two prompt templates that enable GPT-4 to achieve state-of-the-art performance across two established benchmark datasets. Our comprehensive experimental evaluation demonstrates the superior effectiveness and robustness of the proposed approach.

Prompt-Matcher: Leveraging Large Models to Reduce Uncertainty in Schema Matching Results

TL;DR

Prompt-Matcher tackles uncertainty in schema matching by integrating an NP-hard, budget-constrained correspondence selection with LLM-based verification and probabilistic updates. The authors formalize the Correspondence Selection Problem, prove its NP-hardness, and deploy a -approximation greedy algorithm, paired with two targeted prompts (Semantic-Match and Abbreviation-Match) for verification. Empirical results on DeepMDataset and Fabricated-Dataset show state-of-the-art verification performance and faster uncertainty reduction, with greedy selection often ranking the best schema match early under realistic budgets. The approach enables more reliable, cost-aware schema matching in data integration tasks and suggests avenues for enhancing efficiency and LLM self-consistency in future work.

Abstract

Schema matching is the process of identifying correspondences between the elements of two given schemata, essential for database management systems, data integration, and data warehousing. For datasets across different scenarios, the optimal schema matching algorithm is different. For single algorithm, hyperparameter tuning also cases multiple results. All results assigned equal probabilities are stored in probabilistic databases to facilitate uncertainty management. The substantial degree of uncertainty diminishes the efficiency and reliability of data processing, thereby precluding the provision of more accurate information for decision-makers. To address this problem, we introduce a new approach based on fine-grained correspondence verification with specific prompt of Large Language Model. Our approach is an iterative loop that consists of three main components: (1) the correspondence selection algorithm, (2) correspondence verification, and (3) the update of probability distribution. The core idea is that correspondences intersect across multiple results, thereby linking the verification of correspondences to the reduction of uncertainty in candidate results. The task of selecting an optimal correspondence set to maximize the anticipated uncertainty reduction within a fixed budgetary framework is established as an NP-hard problem. We propose a novel -approximation algorithm that significantly outperforms brute algorithm in terms of computational efficiency. To enhance correspondence verification, we have developed two prompt templates that enable GPT-4 to achieve state-of-the-art performance across two established benchmark datasets. Our comprehensive experimental evaluation demonstrates the superior effectiveness and robustness of the proposed approach.
Paper Structure (20 sections, 2 theorems, 18 equations, 9 figures, 8 tables)

This paper contains 20 sections, 2 theorems, 18 equations, 9 figures, 8 tables.

Key Result

Lemma 1

Objective Function. For one View $V$, the selected correspondence set is $\mathcal{T}$. The answer families is $AS^\mathcal{T}$. The Objective Function is

Figures (9)

  • Figure 1: Prompt-Matcher: The total budget is split the budget into k budget shares. At each iteration, correspondence selection algorithm try to selection the correspondence subset that maximize the expectation of uncertainty reduction. Then, LLM proposes the answers and their confidence of correspondence verification. Finally, the probabilities of candidate results are updated with the answers and the confidences.
  • Figure 2: two prompt templates, we show the prompt templates are suitable for schema matching tasks. The blue placeholders are filled with the attribute names and their values in the correspondences. schema_name is filled with the dataset name or domain information.
  • Figure 3: Candidate Result Set, We systematically demonstrate the procedural workflow for candidate result set generation.
  • Figure 4: Assays Experiment Result: MRR
  • Figure 5: Musician Experiment Result: MRR
  • ...and 4 more figures

Theorems & Definitions (11)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Definition 7
  • Definition 8
  • Lemma 1
  • Definition 9
  • ...and 1 more