Table of Contents
Fetching ...

Contextual Graph Embeddings: Accounting for Data Characteristics in Heterogeneous Data Integration

Yuka Haruki, Shigeru Ishikura, Kazuya Demachi, Teruaki Hayashi

TL;DR

This work tackles the challenge of robust data integration for schema matching and entity resolution in heterogeneous datasets by introducing contextual graph embeddings that fuse tabular structure with column descriptions and external knowledge. The proposed 4-partite graph framework extends structural graphs with schema- and instance-level similarities, token merging via FastText, and weighted random walks, yielding 300-dimensional embeddings used for SM and ER. Across two experiments, the method consistently outperforms a baseline graph approach and a GPT-5 LLM, especially on datasets with high numerical content, missing values, and limited overlap, while identifying failure cases where lexically similar but semantically distinct columns are confused. The findings highlight the importance of dataset-aware design and suggest semi-automated, human-in-the-loop validation for practical enterprise deployment, enabling robust data integration under real-world conditions.

Abstract

As organizations continue to access diverse datasets, the demand for effective data integration has increased. Key tasks in this process, such as schema matching and entity resolution, are essential but often require significant effort. Although previous studies have aimed to automate these tasks, the influence of dataset characteristics on the matching effectiveness has not been thoroughly examined, and combinations of different methods remain limited. This study introduces a contextual graph embedding technique that integrates structural details from tabular data and contextual elements such as column descriptions and external knowledge. Tests conducted on datasets with varying properties such as domain specificity, data size, missing rate, and overlap rate showed that our approach consistently surpassed existing graph-based methods, especially in difficult scenarios such those with a high proportion of numerical values or significant missing data. However, we identified specific failure cases, such as columns that were semantically similar but distinct, which remains a challenge for our method. The study highlights two main insights: (i) contextual embeddings enhance the matching reliability, and (ii) dataset characteristics significantly affect the integration outcomes. These contributions can advance the development of practical data integration systems that can support real-world enterprise applications.

Contextual Graph Embeddings: Accounting for Data Characteristics in Heterogeneous Data Integration

TL;DR

This work tackles the challenge of robust data integration for schema matching and entity resolution in heterogeneous datasets by introducing contextual graph embeddings that fuse tabular structure with column descriptions and external knowledge. The proposed 4-partite graph framework extends structural graphs with schema- and instance-level similarities, token merging via FastText, and weighted random walks, yielding 300-dimensional embeddings used for SM and ER. Across two experiments, the method consistently outperforms a baseline graph approach and a GPT-5 LLM, especially on datasets with high numerical content, missing values, and limited overlap, while identifying failure cases where lexically similar but semantically distinct columns are confused. The findings highlight the importance of dataset-aware design and suggest semi-automated, human-in-the-loop validation for practical enterprise deployment, enabling robust data integration under real-world conditions.

Abstract

As organizations continue to access diverse datasets, the demand for effective data integration has increased. Key tasks in this process, such as schema matching and entity resolution, are essential but often require significant effort. Although previous studies have aimed to automate these tasks, the influence of dataset characteristics on the matching effectiveness has not been thoroughly examined, and combinations of different methods remain limited. This study introduces a contextual graph embedding technique that integrates structural details from tabular data and contextual elements such as column descriptions and external knowledge. Tests conducted on datasets with varying properties such as domain specificity, data size, missing rate, and overlap rate showed that our approach consistently surpassed existing graph-based methods, especially in difficult scenarios such those with a high proportion of numerical values or significant missing data. However, we identified specific failure cases, such as columns that were semantically similar but distinct, which remains a challenge for our method. The study highlights two main insights: (i) contextual embeddings enhance the matching reliability, and (ii) dataset characteristics significantly affect the integration outcomes. These contributions can advance the development of practical data integration systems that can support real-world enterprise applications.

Paper Structure

This paper contains 20 sections, 16 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Architecture of the proposed contextual 4-partite graph embedding framework. Steps 1 & 2: The model extends the tripartite graph—with RIDs, TOKs, and column CIDs—into a 4-partite graph by introducing weighted edges between the CIDs based on schema- and instance-level similarity. Step 3: The method learns enriched embeddings that capture both structural and contextual information through token merging and weighted random walks guided by column importance. Step 4: These embeddings are then used for SM and ER, improving robustness across heterogeneous datasets with varied dataset properties.
  • Figure 2: SM performance under varying missing rates.
  • Figure 3: ER performance under varying missing rates.