Table of Contents
Fetching ...

PyGraft: Configurable Generation of Synthetic Schemas and Knowledge Graphs at Your Fingertips

Nicolas Hubert, Pierre Monnin, Mathieu d'Aquin, Davy Monticolo, Armelle Brun

TL;DR

The paper tackles the scarcity and bias of public KGs and the need for domain-agnostic benchmarks by introducing PyGraft, a Python tool that can generate both customizable schemas and knowledge graphs in a single pipeline and verify their logical consistency using a description logic reasoner. The method combines a schema generator (class and relation builders) with a KG generator that assigns entities to classes and creates triples, all while enforcing OWL/RDFS constraints and consistency checks. Its key contributions include domain-agnostic schema and KG generation, an integrated pipeline with DL reasoning, and demonstrated scalability across configurations (e.g., up to 100K entities and 1M triples with first-pass consistency). The tool enables diverse benchmarking for graph-based learning and KG processing, supports privacy-preserving data generation in sensitive domains, and fosters schema-driven, neuro-symbolic approaches, with an open-source release to encourage community contributions.

Abstract

Knowledge graphs (KGs) have emerged as a prominent data representation and management paradigm. Being usually underpinned by a schema (e.g., an ontology), KGs capture not only factual information but also contextual knowledge. In some tasks, a few KGs established themselves as standard benchmarks. However, recent works outline that relying on a limited collection of datasets is not sufficient to assess the generalization capability of an approach. In some data-sensitive fields such as education or medicine, access to public datasets is even more limited. To remedy the aforementioned issues, we release PyGraft, a Python-based tool that generates highly customized, domain-agnostic schemas and KGs. The synthesized schemas encompass various RDFS and OWL constructs, while the synthesized KGs emulate the characteristics and scale of real-world KGs. Logical consistency of the generated resources is ultimately ensured by running a description logic (DL) reasoner. By providing a way of generating both a schema and KG in a single pipeline, PyGraft's aim is to empower the generation of a more diverse array of KGs for benchmarking novel approaches in areas such as graph-based machine learning (ML), or more generally KG processing. In graph-based ML in particular, this should foster a more holistic evaluation of model performance and generalization capability, thereby going beyond the limited collection of available benchmarks. PyGraft is available at: https://github.com/nicolas-hbt/pygraft.

PyGraft: Configurable Generation of Synthetic Schemas and Knowledge Graphs at Your Fingertips

TL;DR

The paper tackles the scarcity and bias of public KGs and the need for domain-agnostic benchmarks by introducing PyGraft, a Python tool that can generate both customizable schemas and knowledge graphs in a single pipeline and verify their logical consistency using a description logic reasoner. The method combines a schema generator (class and relation builders) with a KG generator that assigns entities to classes and creates triples, all while enforcing OWL/RDFS constraints and consistency checks. Its key contributions include domain-agnostic schema and KG generation, an integrated pipeline with DL reasoning, and demonstrated scalability across configurations (e.g., up to 100K entities and 1M triples with first-pass consistency). The tool enables diverse benchmarking for graph-based learning and KG processing, supports privacy-preserving data generation in sensitive domains, and fosters schema-driven, neuro-symbolic approaches, with an open-source release to encourage community contributions.

Abstract

Knowledge graphs (KGs) have emerged as a prominent data representation and management paradigm. Being usually underpinned by a schema (e.g., an ontology), KGs capture not only factual information but also contextual knowledge. In some tasks, a few KGs established themselves as standard benchmarks. However, recent works outline that relying on a limited collection of datasets is not sufficient to assess the generalization capability of an approach. In some data-sensitive fields such as education or medicine, access to public datasets is even more limited. To remedy the aforementioned issues, we release PyGraft, a Python-based tool that generates highly customized, domain-agnostic schemas and KGs. The synthesized schemas encompass various RDFS and OWL constructs, while the synthesized KGs emulate the characteristics and scale of real-world KGs. Logical consistency of the generated resources is ultimately ensured by running a description logic (DL) reasoner. By providing a way of generating both a schema and KG in a single pipeline, PyGraft's aim is to empower the generation of a more diverse array of KGs for benchmarking novel approaches in areas such as graph-based machine learning (ML), or more generally KG processing. In graph-based ML in particular, this should foster a more holistic evaluation of model performance and generalization capability, thereby going beyond the limited collection of available benchmarks. PyGraft is available at: https://github.com/nicolas-hbt/pygraft.
Paper Structure (14 sections, 4 figures, 5 tables, 3 algorithms)

This paper contains 14 sections, 4 figures, 5 tables, 3 algorithms.

Figures (4)

  • Figure 1: PyGraft general overview.
  • Figure 2: Potential class hierarchies for the constraints: $\texttt{num\_classes} = 6$, $\texttt{max\_depth} = 3$, $\texttt{avg\_depth} = 1.5$, and $\texttt{inheritance\_ratio} = 2.5$. Left and middle class hierarchies are built with parameter priority. The right class hierarchy is built with a best-effort strategy, without specific parameter privilege.
  • Figure 3: Execution time breakdown for each configuration.
  • Figure : Class Generation