ReMatch: Retrieval Enhanced Schema Matching with LLMs

Eitam Sheetrit; Menachem Brief; Moshik Mishaeli; Oren Elisha

ReMatch: Retrieval Enhanced Schema Matching with LLMs

Eitam Sheetrit, Menachem Brief, Moshik Mishaeli, Oren Elisha

TL;DR

This work tackles semantic schema matching under privacy constraints by proposing ReMatch, a retrieval-enhanced LLM approach that does not require predefined mappings or model training. It converts schemas into document corpora, uses embeddings to retrieve candidate target tables, and applies a generative LLM to rank matches, producing an $N\times K$ mapping for $N=|\\mathcal{A}_1|$. Evaluations on large healthcare datasets (MIMIC-III to OMOP and Synthea to OMOP) show that retrieval and guided prompting improve accuracy@$K$, with ReMatch outperforming prior methods like SMAT under realistic training splits. The method demonstrates practical potential for real-world data integration by offering scalable, privacy-preserving schema matching and by providing a new, sizable healthcare mapping dataset for benchmarking.

Abstract

Schema matching is a crucial task in data integration, involving the alignment of a source schema with a target schema to establish correspondence between their elements. This task is challenging due to textual and semantic heterogeneity, as well as differences in schema sizes. Although machine-learning-based solutions have been explored in numerous studies, they often suffer from low accuracy, require manual mapping of the schemas for model training, or need access to source schema data which might be unavailable due to privacy concerns. In this paper we present a novel method, named ReMatch, for matching schemas using retrieval-enhanced Large Language Models (LLMs). Our method avoids the need for predefined mapping, any model training, or access to data in the source database. Our experimental results on large real-world schemas demonstrate that ReMatch is an effective matcher. By eliminating the requirement for training data, ReMatch becomes a viable solution for real-world scenarios.

ReMatch: Retrieval Enhanced Schema Matching with LLMs

TL;DR

mapping for

. Evaluations on large healthcare datasets (MIMIC-III to OMOP and Synthea to OMOP) show that retrieval and guided prompting improve accuracy@

, with ReMatch outperforming prior methods like SMAT under realistic training splits. The method demonstrates practical potential for real-world data integration by offering scalable, privacy-preserving schema matching and by providing a new, sizable healthcare mapping dataset for benchmarking.

Abstract

Paper Structure (8 sections, 1 equation, 2 figures, 6 tables, 1 algorithm)

This paper contains 8 sections, 1 equation, 2 figures, 6 tables, 1 algorithm.

Introduction
Background and Related Work
ReMatch
Evaluation
Dataset Creation
Experiments
Results
Conclusions

Figures (2)

Figure 1: Overview of the ReMatch method.
Figure 2: Comparison of the different models' performance. SMAT was trained and evaluated on 20%, 80% of the data, after removing all null mappings. ReMatch was evaluated on the entire dataset, with no guidance, and with nulls. Optimal setup from grid search is shown for ReMatch.

ReMatch: Retrieval Enhanced Schema Matching with LLMs

TL;DR

Abstract

ReMatch: Retrieval Enhanced Schema Matching with LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (2)