Table of Contents
Fetching ...

Schema Matching on Graph: Iterative Graph Exploration for Efficient and Explainable Data Integration

Mingyu Jeon, Jaeyoung Suh, Suwan Cho

TL;DR

This work reexamines KG-based schema matching in the medical domain by reviving a query-based approach. It introduces SMoG, which iteratively explores a knowledge graph using 1-hop SPARQL queries guided by an LLM, producing human-verifiable reasoning paths and avoiding heavy vector indices. On CMS data with Wikidata, SMoG achieves competitive F1 scores relative to state-of-the-art baselines while offering transparency and storage efficiency. The study analyzes reasoning depth and topic extraction strategies, revealing that semantic similarity is a more reliable signal than lexical overlap in medical terminology, and highlights future directions in adaptive reasoning control.

Abstract

Schema matching is a critical task in data integration, particularly in the medical domain where disparate Electronic Health Record (EHR) systems must be aligned to standard models like OMOP CDM. While Large Language Models (LLMs) have shown promise in schema matching, they suffer from hallucination and lack of up-to-date domain knowledge. Knowledge Graphs (KGs) offer a solution by providing structured, verifiable knowledge. However, existing KG-augmented LLM approaches often rely on inefficient complex multi-hop queries or storage-intensive vector-based retrieval methods. This paper introduces SMoG (Schema Matching on Graph), a novel framework that leverages iterative execution of simple 1-hop SPARQL queries, inspired by successful strategies in Knowledge Graph Question Answering (KGQA). SMoG enhances explainability and reliability by generating human-verifiable query paths while significantly reducing storage requirements by directly querying SPARQL endpoints. Experimental results on real-world medical datasets demonstrate that SMoG achieves performance comparable to state-of-the-art baselines, validating its effectiveness and efficiency in KG-augmented schema matching.

Schema Matching on Graph: Iterative Graph Exploration for Efficient and Explainable Data Integration

TL;DR

This work reexamines KG-based schema matching in the medical domain by reviving a query-based approach. It introduces SMoG, which iteratively explores a knowledge graph using 1-hop SPARQL queries guided by an LLM, producing human-verifiable reasoning paths and avoiding heavy vector indices. On CMS data with Wikidata, SMoG achieves competitive F1 scores relative to state-of-the-art baselines while offering transparency and storage efficiency. The study analyzes reasoning depth and topic extraction strategies, revealing that semantic similarity is a more reliable signal than lexical overlap in medical terminology, and highlights future directions in adaptive reasoning control.

Abstract

Schema matching is a critical task in data integration, particularly in the medical domain where disparate Electronic Health Record (EHR) systems must be aligned to standard models like OMOP CDM. While Large Language Models (LLMs) have shown promise in schema matching, they suffer from hallucination and lack of up-to-date domain knowledge. Knowledge Graphs (KGs) offer a solution by providing structured, verifiable knowledge. However, existing KG-augmented LLM approaches often rely on inefficient complex multi-hop queries or storage-intensive vector-based retrieval methods. This paper introduces SMoG (Schema Matching on Graph), a novel framework that leverages iterative execution of simple 1-hop SPARQL queries, inspired by successful strategies in Knowledge Graph Question Answering (KGQA). SMoG enhances explainability and reliability by generating human-verifiable query paths while significantly reducing storage requirements by directly querying SPARQL endpoints. Experimental results on real-world medical datasets demonstrate that SMoG achieves performance comparable to state-of-the-art baselines, validating its effectiveness and efficiency in KG-augmented schema matching.

Paper Structure

This paper contains 43 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Overview of the SMoG Framework. The framework consists of two main phases: Topic Entity Extraction (TEE) and Graph Exploration (GE). TEE identifies the starting entity for a given attribute, while GE iteratively explores the Knowledge Graph using 1-hop SPARQL queries to find the optimal matching path.