Table of Contents
Fetching ...

Schema Generation for Large Knowledge Graphs Using Large Language Models

Bohui Zhang, Yuan He, Lydia Pintscher, Albert Meroño Peñuela, Elena Simperl

TL;DR

This work introduces the first benchmark for automatically generating Shape Expressions schemas from large knowledge graphs using large language models. It presents two KG-derived datasets, YAGO Schema (YAGOS) and Wikidata EntitySchema (WES), and a dual-metric evaluation framework combining structure-based similarity (GED/NGED) with constraint-level classification (exact and relaxed matches). The authors develop LLM-driven pipelines with local, global, and triples information settings, plus a structured generation approach that converts JSON constraints into ShEx, and demonstrate competitive performance against strong baselines. The results indicate LLMs have strong potential for scalable, automated KG schema generation, while also highlighting areas for improvement in cardinality reasoning and strict predicate-class reasoning. Overall, the paper provides a foundational benchmark and methodology for future research in automated KG schema generation and structured generation with LLMs.

Abstract

Schemas play a vital role in ensuring data quality and supporting usability in the Semantic Web and natural language processing. Traditionally, their creation demands substantial involvement from knowledge engineers and domain experts. Leveraging the impressive capabilities of large language models (LLMs) in tasks like ontology engineering, we explore schema generation using LLMs. To bridge the resource gap, we introduce two datasets: YAGO Schema and Wikidata EntitySchema, along with novel evaluation metrics. The LLM-based pipelines utilize local and global information from knowledge graphs (KGs) to generate schemas in Shape Expressions (ShEx). Experiments demonstrate LLMs' strong potential in producing high-quality ShEx schemas, paving the way for scalable, automated schema generation for large KGs. Furthermore, our benchmark introduces a new challenge for structured generation, pushing the limits of LLMs on syntactically rich formalisms.

Schema Generation for Large Knowledge Graphs Using Large Language Models

TL;DR

This work introduces the first benchmark for automatically generating Shape Expressions schemas from large knowledge graphs using large language models. It presents two KG-derived datasets, YAGO Schema (YAGOS) and Wikidata EntitySchema (WES), and a dual-metric evaluation framework combining structure-based similarity (GED/NGED) with constraint-level classification (exact and relaxed matches). The authors develop LLM-driven pipelines with local, global, and triples information settings, plus a structured generation approach that converts JSON constraints into ShEx, and demonstrate competitive performance against strong baselines. The results indicate LLMs have strong potential for scalable, automated KG schema generation, while also highlighting areas for improvement in cardinality reasoning and strict predicate-class reasoning. Overall, the paper provides a foundational benchmark and methodology for future research in automated KG schema generation and structured generation with LLMs.

Abstract

Schemas play a vital role in ensuring data quality and supporting usability in the Semantic Web and natural language processing. Traditionally, their creation demands substantial involvement from knowledge engineers and domain experts. Leveraging the impressive capabilities of large language models (LLMs) in tasks like ontology engineering, we explore schema generation using LLMs. To bridge the resource gap, we introduce two datasets: YAGO Schema and Wikidata EntitySchema, along with novel evaluation metrics. The LLM-based pipelines utilize local and global information from knowledge graphs (KGs) to generate schemas in Shape Expressions (ShEx). Experiments demonstrate LLMs' strong potential in producing high-quality ShEx schemas, paving the way for scalable, automated schema generation for large KGs. Furthermore, our benchmark introduces a new challenge for structured generation, pushing the limits of LLMs on syntactically rich formalisms.

Paper Structure

This paper contains 27 sections, 12 equations, 3 figures, 12 tables.

Figures (3)

  • Figure 1: Experimental setup. In the local and triples settings, LLMs generate ShEx schema scripts end-to-end. In the global setting, LLMs first generate constraints in JSON format using their structured generation ability, which are then formulated into ShEx schema scripts.
  • Figure 2: Error distribution across models and settings on WES (left) and YAGOS (right) datasets. The figure shows five categories: four error types and correctly generated constraints.
  • Figure 3: Example ShEx schema fragment for 'Museum (Q33506)' from the WES dataset: (a) the ShExC textual representation, where comments above each constraint indicate the label of its predicate, and (b) its corresponding tree structure representation used for similarity metrics.

Theorems & Definitions (3)

  • Definition 1: Knowledge Graph
  • Definition 2: Shape
  • Definition 3: Schema Generation