Table of Contents
Fetching ...

KGpipe: Generation and Evaluation of Pipelines for Data Integration into Knowledge Graphs

Marvin Hofer, Erhard Rahm

TL;DR

KGpipe presents an open-source framework to define, execute, and evaluate end-to-end data integration pipelines for knowledge graphs, enabling composition of task-specific tools and LLM components across heterogeneous inputs. The authors introduce a formal problem definition, source-type aware integration strategies, and a modular implementation with Python, Docker, and HTTP backends, anchored by two intermediate exchange formats $JSON_{ER}$ and $JSON_{KE}$. A movie-domain benchmark paired with a seed KG, multiple source formats, and a reference KG enables systematic evaluation over Statistical, Semantic, and Reference metrics, with an aggregation scheme to rank pipelines under configurable weights. Empirical results show RDF-based pipelines deliver stronger structural and semantic quality, while LLM-based components improve ontology matching and relation mapping at higher costs and runtimes; text-derived pipelines lag in recall and precision. The work provides a foundation for reproducible KG construction and paves the way for data-driven pipeline design and automatic recommendations conditioned on ontology, seed graph, and source descriptors.

Abstract

Building high-quality knowledge graphs (KGs) from diverse sources requires combining methods for information extraction, data transformation, ontology mapping, entity matching, and data fusion. Numerous methods and tools exist for each of these tasks, but support for combining them into reproducible and effective end-to-end pipelines is still lacking. We present a new framework, KGpipe for defining and executing integration pipelines that can combine existing tools or LLM (Large Language Model) functionality. To evaluate different pipelines and the resulting KGs, we propose a benchmark to integrate heterogeneous data of different formats (RDF, JSON, text) into a seed KG. We demonstrate the flexibility of KGpipe by running and comparatively evaluating several pipelines integrating sources of the same or different formats using selected performance and quality metrics.

KGpipe: Generation and Evaluation of Pipelines for Data Integration into Knowledge Graphs

TL;DR

KGpipe presents an open-source framework to define, execute, and evaluate end-to-end data integration pipelines for knowledge graphs, enabling composition of task-specific tools and LLM components across heterogeneous inputs. The authors introduce a formal problem definition, source-type aware integration strategies, and a modular implementation with Python, Docker, and HTTP backends, anchored by two intermediate exchange formats and . A movie-domain benchmark paired with a seed KG, multiple source formats, and a reference KG enables systematic evaluation over Statistical, Semantic, and Reference metrics, with an aggregation scheme to rank pipelines under configurable weights. Empirical results show RDF-based pipelines deliver stronger structural and semantic quality, while LLM-based components improve ontology matching and relation mapping at higher costs and runtimes; text-derived pipelines lag in recall and precision. The work provides a foundation for reproducible KG construction and paves the way for data-driven pipeline design and automatic recommendations conditioned on ontology, seed graph, and source descriptors.

Abstract

Building high-quality knowledge graphs (KGs) from diverse sources requires combining methods for information extraction, data transformation, ontology mapping, entity matching, and data fusion. Numerous methods and tools exist for each of these tasks, but support for combining them into reproducible and effective end-to-end pipelines is still lacking. We present a new framework, KGpipe for defining and executing integration pipelines that can combine existing tools or LLM (Large Language Model) functionality. To evaluate different pipelines and the resulting KGs, we propose a benchmark to integrate heterogeneous data of different formats (RDF, JSON, text) into a seed KG. We demonstrate the flexibility of KGpipe by running and comparatively evaluating several pipelines integrating sources of the same or different formats using selected performance and quality metrics.

Paper Structure

This paper contains 30 sections, 6 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: RDF single source pipeline layouts. TC=Type Completion.
  • Figure 2: JSON single source pipelines layouts. TC=Type Completion
  • Figure 3: Text single source pipelines layouts. TC=Type Completion
  • Figure 4: Ontology/Schema graph of classes film, person, company with their properties (relations and attributes).
  • Figure 5: Visualization of statistical metrics (growth) for the 12 pipelines and their three increments/stages $KG_1-KG_3$. The black dotted lines indicate the expected reference KG sizes. All three SSP (c) pipelines are omitted here.
  • ...and 1 more figures