Table of Contents
Fetching ...

KIF: A Wikidata-Based Framework for Integrating Heterogeneous Knowledge Sources

Guilherme Lima, João M. B. Rodrigues, Marcelo Machado, Elton Soares, Sandro R. Fiorini, Raphael Thiago, Leonardo G. Azevedo, Viviane T. da Silva, Renato Cerqueira

TL;DR

KIF presents a Wikidata-based framework for virtually integrating heterogeneous knowledge sources by using a store abstraction to provide Wikidata-like views of diverse data sources. It introduces a pattern-driven query interface and a mixer store to federate multiple sources, while preserving provenance through annotations. The framework includes a SPARQL store, a PubChem mapping, and additional store types (RDF, CSV), enabling integration with sources such as Wikidata, PubChem, and IBM CIRCA via Ontop. An application in chemistry demonstrates the approach, with an evaluation showing KIF's overhead is negligible compared to endpoint processing. The work highlights the practicality of Wikidata as a universal integration model and outlines future improvements like parallel querying, mutable stores, and formal semantics.

Abstract

We present a Wikidata-based framework, called KIF, for virtually integrating heterogeneous knowledge sources. KIF is written in Python and is released as open-source. It leverages Wikidata's data model and vocabulary plus user-defined mappings to construct a unified view of the underlying sources while keeping track of the context and provenance of their statements. The underlying sources can be triplestores, relational databases, CSV files, etc., which may or may not use the vocabulary and RDF encoding of Wikidata. The end result is a virtual knowledge base which behaves like an "extended Wikidata" and which can be queried using a simple but expressive pattern language, defined in terms of Wikidata's data model. In this paper, we present the design and implementation of KIF, discuss how we have used it to solve a real integration problem in the domain of chemistry (involving Wikidata, PubChem, and IBM CIRCA), and present experimental results on the performance and overhead of KIF

KIF: A Wikidata-Based Framework for Integrating Heterogeneous Knowledge Sources

TL;DR

KIF presents a Wikidata-based framework for virtually integrating heterogeneous knowledge sources by using a store abstraction to provide Wikidata-like views of diverse data sources. It introduces a pattern-driven query interface and a mixer store to federate multiple sources, while preserving provenance through annotations. The framework includes a SPARQL store, a PubChem mapping, and additional store types (RDF, CSV), enabling integration with sources such as Wikidata, PubChem, and IBM CIRCA via Ontop. An application in chemistry demonstrates the approach, with an evaluation showing KIF's overhead is negligible compared to endpoint processing. The work highlights the practicality of Wikidata as a universal integration model and outlines future improvements like parallel querying, mutable stores, and formal semantics.

Abstract

We present a Wikidata-based framework, called KIF, for virtually integrating heterogeneous knowledge sources. KIF is written in Python and is released as open-source. It leverages Wikidata's data model and vocabulary plus user-defined mappings to construct a unified view of the underlying sources while keeping track of the context and provenance of their statements. The underlying sources can be triplestores, relational databases, CSV files, etc., which may or may not use the vocabulary and RDF encoding of Wikidata. The end result is a virtual knowledge base which behaves like an "extended Wikidata" and which can be queried using a simple but expressive pattern language, defined in terms of Wikidata's data model. In this paper, we present the design and implementation of KIF, discuss how we have used it to solve a real integration problem in the domain of chemistry (involving Wikidata, PubChem, and IBM CIRCA), and present experimental results on the performance and overhead of KIF
Paper Structure (17 sections, 2 equations, 6 figures)

This paper contains 17 sections, 2 equations, 6 figures.

Figures (6)

  • Figure 1: Part of Wikidata's entity page of benzene. (Adapted from Odell-J-2022.)
  • Figure 2: Constructors of data model objects in KIF. "${?}$" means zero-or-one; "$+$" means one-or-more; $s$ is a Python string; $\text{lang}$ is a Python string containing language tag such as "en"; $n$ is a number; $i$ is an integer; and $ts$ is a date-time timestamp.
  • Figure 3: RDF representation of the statement "Benzene (Q2270) has an LD50 (P2240) of 4,700 $\pm$1 mg/kg (Q21091747)" considering only the qualifier "route of administration (P636) is oral administration (Q285166)" and the reference record "reference URL (P854) is https://www.cdc.gov/niosh-rtecs/CY155CC0.html".
  • Figure 4: Evaluation of match(!$p$!) over a SPARQL store.
  • Figure 5: KIF instantiation integrating PubChem, Wikidata, and IBM CIRCA.
  • ...and 1 more figures