Table of Contents
Fetching ...

Knowledge Graph-based Retrieval-Augmented Generation for Schema Matching

Chuangtao Ma, Sriom Chakrabarti, Arijit Khan, Bálint Molnár

TL;DR

This work tackles semantic ambiguities and LLM hallucinations in schema matching by introducing KG-RAG4SM, a knowledge-graph-based retrieval-augmented generation framework. It retrieves relevant subgraphs from large KGs through vector-based, traversal-based, and query-based methods, ranks and prunes them, and uses them to augment LLM prompts without re-training. Across healthcare-focused datasets, KG-RAG4SM consistently outperforms state-of-the-art LLM- and PLM-based methods in Precision and F1, while also demonstrating efficiency and scalability on large knowledge graphs. A case study on real-world EMED data confirms reduced hallucinations when external KG context informs the matching decisions, underscoring practical impact for data integration in healthcare. The approach highlights the potential of external knowledge to enhance LLM-based data integration tasks beyond schema matching, with pathways for broader domain KG deployment.

Abstract

Traditional similarity-based schema matching methods are incapable of resolving semantic ambiguities and conflicts in domain-specific complex mapping scenarios due to missing commonsense and domain-specific knowledge. The hallucination problem of large language models (LLMs) also makes it challenging for LLM-based schema matching to address the above issues. Therefore, we propose a Knowledge Graph-based Retrieval-Augmented Generation model for Schema Matching, referred to as the KG-RAG4SM. In particular, KG-RAG4SM introduces novel vector-based, graph traversal-based, and query-based graph retrievals, as well as a hybrid approach and ranking schemes that identify the most relevant subgraphs from external large knowledge graphs (KGs). We showcase that KG-based retrieval-augmented LLMs are capable of generating more accurate results for complex matching cases without any re-training. Our experimental results show that KG-RAG4SM outperforms the LLM-based state-of-the-art (SOTA) methods (e.g., Jellyfish-8B) by 35.89% and 30.50% in terms of precision and F1 score on the MIMIC dataset, respectively; KG-RAG4SM with GPT-4o-mini outperforms the pre-trained language model (PLM)-based SOTA methods (e.g., SMAT) by 69.20% and 21.97% in terms of precision and F1 score on the Synthea dataset, respectively. The results also demonstrate that our approach is more efficient in end-to-end schema matching, and scales to retrieve from large KGs. Our case studies on the dataset from the real-world schema matching scenario exhibit that the hallucination problem of LLMs for schema matching is well mitigated by our solution.

Knowledge Graph-based Retrieval-Augmented Generation for Schema Matching

TL;DR

This work tackles semantic ambiguities and LLM hallucinations in schema matching by introducing KG-RAG4SM, a knowledge-graph-based retrieval-augmented generation framework. It retrieves relevant subgraphs from large KGs through vector-based, traversal-based, and query-based methods, ranks and prunes them, and uses them to augment LLM prompts without re-training. Across healthcare-focused datasets, KG-RAG4SM consistently outperforms state-of-the-art LLM- and PLM-based methods in Precision and F1, while also demonstrating efficiency and scalability on large knowledge graphs. A case study on real-world EMED data confirms reduced hallucinations when external KG context informs the matching decisions, underscoring practical impact for data integration in healthcare. The approach highlights the potential of external knowledge to enhance LLM-based data integration tasks beyond schema matching, with pathways for broader domain KG deployment.

Abstract

Traditional similarity-based schema matching methods are incapable of resolving semantic ambiguities and conflicts in domain-specific complex mapping scenarios due to missing commonsense and domain-specific knowledge. The hallucination problem of large language models (LLMs) also makes it challenging for LLM-based schema matching to address the above issues. Therefore, we propose a Knowledge Graph-based Retrieval-Augmented Generation model for Schema Matching, referred to as the KG-RAG4SM. In particular, KG-RAG4SM introduces novel vector-based, graph traversal-based, and query-based graph retrievals, as well as a hybrid approach and ranking schemes that identify the most relevant subgraphs from external large knowledge graphs (KGs). We showcase that KG-based retrieval-augmented LLMs are capable of generating more accurate results for complex matching cases without any re-training. Our experimental results show that KG-RAG4SM outperforms the LLM-based state-of-the-art (SOTA) methods (e.g., Jellyfish-8B) by 35.89% and 30.50% in terms of precision and F1 score on the MIMIC dataset, respectively; KG-RAG4SM with GPT-4o-mini outperforms the pre-trained language model (PLM)-based SOTA methods (e.g., SMAT) by 69.20% and 21.97% in terms of precision and F1 score on the Synthea dataset, respectively. The results also demonstrate that our approach is more efficient in end-to-end schema matching, and scales to retrieve from large KGs. Our case studies on the dataset from the real-world schema matching scenario exhibit that the hallucination problem of LLMs for schema matching is well mitigated by our solution.
Paper Structure (32 sections, 5 equations, 10 figures, 11 tables, 1 algorithm)

This paper contains 32 sections, 5 equations, 10 figures, 11 tables, 1 algorithm.

Figures (10)

  • Figure 1: Example of schema matching in the EHR data model. Table (a) and Table (b) are from the source schema, and Table (c) is from the target schema. The textual description for attributes originates from the database design documents. The solid arrows indicate the corresponding mappings that are marked with the same color.
  • Figure 2: The Role of KG context in augmenting LLMs for schema matching.
  • Figure 3: Overview of our proposed KG-RAG4SM method. Given schema matching questions and an external knowledge graph, we (1) retrieve the relevant KG triplets based on vector similarity between questions embeddings and $\mathcal{KG}$ triplet embeddings; (2) prune the retrieved relevant KG triplets with vector similarity-based ranking; (3) augment prompts with the retrieved and refined subgraphs from large-size $\mathcal{KG}$ and generate the final answer for the given schema matching questions.
  • Figure 4: Prompts for LLMs as schema matcher.
  • Figure 5: Prompts for LLMs as entity retriever.
  • ...and 5 more figures