Table of Contents
Fetching ...

Matchmaker: Self-Improving Large Language Model Programs for Schema Matching

Nabeel Seedat, Mihaela van der Schaar

TL;DR

The proposed Matchmaker is a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring, which outperforms previous ML-based approaches and has the potential to accelerate data integration and interoperability of ML-ready data.

Abstract

Schema matching -- the task of finding matches between attributes across disparate data sources with different tables and hierarchies -- is critical for creating interoperable machine learning (ML)-ready data. Addressing this fundamental data-centric problem has wide implications, especially in domains like healthcare, finance and e-commerce -- but also has the potential to benefit ML models more generally, by increasing the data available for ML model training. However, schema matching is a challenging ML task due to structural/hierarchical and semantic heterogeneity between different schemas. Previous ML approaches to automate schema matching have either required significant labeled data for model training, which is often unrealistic or suffer from poor zero-shot performance. To this end, we propose Matchmaker - a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring. Matchmaker also self-improves in a zero-shot manner without the need for labeled demonstrations via a novel optimization approach, which constructs synthetic in-context demonstrations to guide the language model's reasoning process. Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches, highlighting its potential to accelerate data integration and interoperability of ML-ready data.

Matchmaker: Self-Improving Large Language Model Programs for Schema Matching

TL;DR

The proposed Matchmaker is a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring, which outperforms previous ML-based approaches and has the potential to accelerate data integration and interoperability of ML-ready data.

Abstract

Schema matching -- the task of finding matches between attributes across disparate data sources with different tables and hierarchies -- is critical for creating interoperable machine learning (ML)-ready data. Addressing this fundamental data-centric problem has wide implications, especially in domains like healthcare, finance and e-commerce -- but also has the potential to benefit ML models more generally, by increasing the data available for ML model training. However, schema matching is a challenging ML task due to structural/hierarchical and semantic heterogeneity between different schemas. Previous ML approaches to automate schema matching have either required significant labeled data for model training, which is often unrealistic or suffer from poor zero-shot performance. To this end, we propose Matchmaker - a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring. Matchmaker also self-improves in a zero-shot manner without the need for labeled demonstrations via a novel optimization approach, which constructs synthetic in-context demonstrations to guide the language model's reasoning process. Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches, highlighting its potential to accelerate data integration and interoperability of ML-ready data.

Paper Structure

This paper contains 42 sections, 2 equations, 17 figures, 8 tables, 2 algorithms.

Figures (17)

  • Figure 1: Example showing the complexity of schema matching due to the multi-faceted challenges: Database heterogeneity (green arrows): Identifying the correct target table is the first step, as each schema has a different number of tables, the corresponding information may be distributed differently across tables in each schema. Structural heterogeneity (green arrows): Once the appropriate table is found, matching attributes is complicated by differences in schema architectures, hierarchies, and granularity. Textual heterogeneity (green arrows): Ambiguity in matching when attributes have the same names but different meanings, or different names with the same meaning. Information mismatch (red arrows): Some attributes in one schema may lack a corresponding match in the other schema, adding to the complexity of the matching process.
  • Figure 2: Example result shows semantic similarity alone cannot solve schema matching, with low accuracy@k, compared to Matchmaker.
  • Figure 3: Conceptual comparison of different schema matching approaches. (A) Supervised Matching Zhang2021SMATAA employs a trained neural network (e.g., a transformer) to predict binary match/no-match labels across all attribute pairs, scaling as $\mathcal{O}(n)^2$ and requiring labeled data, thus unsuitable for zero-shot. (B) LLM-Prompting Narayan2022CanFMZhang2023LargeDP uses a frozen language model (e.g., GPT-4) for the same task, with similar scalability. Alternatively, zhang2023jellyfish fine-tunes the LLM, which requires labeled data. (C) RAG-Based sheetrit2024rematch improves scalability by retrieving candidates from a vector database and using a frozen LLM to select matches, but its effectiveness is limited to semantically similar options. (D) Matchmaker (Ours) performs schema matching via a self-improving, compositional language model program that enables enhanced reasoning. The program includes both retrieval and reasoning-based candidate generation with refinement and confidence scoring, allowing for better ranking. The program is optimized using synthetic in-context examples in the LLM prompts.
  • Figure 4: Examples of using Matchmaker in practice. (a) Deferring uncertain samples to humans via entropy deferral improves schema matching performance. (b) Performance gains are obtained when correcting errors which are semantically similar to the true attribute.
  • Figure 5: Illustration of the MIMIC-OMOP schema matching task showing the complexity and schema hierarchies.
  • ...and 12 more figures

Theorems & Definitions (1)

  • Definition 1: Schema Matching