Table of Contents
Fetching ...

Foundations of LLM Knowledge Materialization: Termination, Reproducibility, Robustness

Luca Giordano, Simon Razniewski

TL;DR

This work systematically investigates the termination, reproducibility, and robustness of LLM knowledge materialization using domain-specific miniGPTKBs. It introduces miniGPTKBs (Babylon history, The Big Bang Theory entertainment, DAX 40 finance) and uses a two-phase GPTKB workflow to extract (s,p,o) triples from seeds, evaluating termination, yield, lexical similarity, and semantic similarity across seeds, languages, randomness, and models. Key findings show high termination rates in several settings but language and model variations can hinder termination; reproducibility is mixed with stable yields but modest lexical alignment and strong semantic consistency, while robustness favors seed and temperature perturbations but is weaker for language and model changes; ensembling significantly improves stability. The results demonstrate that GPTKB-style knowledge materialization can reliably surface core knowledge for domain-specific LLM outputs, while highlighting important limitations for multilingual and model-variant contexts and providing practical guidance for building stable, reusable knowledge graphs from LLMs. The work contributes empirical evidence and methodological tools (miniGPTKBs, evaluation metrics, and ensembling strategies) toward more interpretable and reproducible LLM knowledge extraction with potential impact on QA, knowledge integration, and domain-specific AI systems.

Abstract

Large Language Models (LLMs) encode substantial factual knowledge, yet measuring and systematizing this knowledge remains challenging. Converting it into structured format, for example through recursive extraction approaches such as the GPTKB methodology (Hu et al., 2025b), is still underexplored. Key open questions include whether such extraction can terminate, whether its outputs are reproducible, and how robust they are to variations. We systematically study LLM knowledge materialization using miniGPTKBs (domain-specific, tractable subcrawls), analyzing termination, reproducibility, and robustness across three categories of metrics: yield, lexical similarity, and semantic similarity. We experiment with four variations (seed, language, randomness, model) and three illustrative domains (from history, entertainment, and finance). Our findings show (i) high termination rates, though model-dependent; (ii) mixed reproducibility; and (iii) robustness that varies by perturbation type: high for seeds and temperature, lower for languages and models. These results suggest that LLM knowledge materialization can reliably surface core knowledge, while also revealing important limitations.

Foundations of LLM Knowledge Materialization: Termination, Reproducibility, Robustness

TL;DR

This work systematically investigates the termination, reproducibility, and robustness of LLM knowledge materialization using domain-specific miniGPTKBs. It introduces miniGPTKBs (Babylon history, The Big Bang Theory entertainment, DAX 40 finance) and uses a two-phase GPTKB workflow to extract (s,p,o) triples from seeds, evaluating termination, yield, lexical similarity, and semantic similarity across seeds, languages, randomness, and models. Key findings show high termination rates in several settings but language and model variations can hinder termination; reproducibility is mixed with stable yields but modest lexical alignment and strong semantic consistency, while robustness favors seed and temperature perturbations but is weaker for language and model changes; ensembling significantly improves stability. The results demonstrate that GPTKB-style knowledge materialization can reliably surface core knowledge for domain-specific LLM outputs, while highlighting important limitations for multilingual and model-variant contexts and providing practical guidance for building stable, reusable knowledge graphs from LLMs. The work contributes empirical evidence and methodological tools (miniGPTKBs, evaluation metrics, and ensembling strategies) toward more interpretable and reproducible LLM knowledge extraction with potential impact on QA, knowledge integration, and domain-specific AI systems.

Abstract

Large Language Models (LLMs) encode substantial factual knowledge, yet measuring and systematizing this knowledge remains challenging. Converting it into structured format, for example through recursive extraction approaches such as the GPTKB methodology (Hu et al., 2025b), is still underexplored. Key open questions include whether such extraction can terminate, whether its outputs are reproducible, and how robust they are to variations. We systematically study LLM knowledge materialization using miniGPTKBs (domain-specific, tractable subcrawls), analyzing termination, reproducibility, and robustness across three categories of metrics: yield, lexical similarity, and semantic similarity. We experiment with four variations (seed, language, randomness, model) and three illustrative domains (from history, entertainment, and finance). Our findings show (i) high termination rates, though model-dependent; (ii) mixed reproducibility; and (iii) robustness that varies by perturbation type: high for seeds and temperature, lower for languages and models. These results suggest that LLM knowledge materialization can reliably surface core knowledge, while also revealing important limitations.

Paper Structure

This paper contains 37 sections, 5 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Overview of our approach.
  • Figure 2: Overview of research questions and methodology.
  • Figure 3: Similarity metrics across entity popularity buckets under the different settings and topics.
  • Figure 4: Triples shared across X runs of babylonGPTKB with different k values.
  • Figure 5: Triples shared across X runs of tbbtGPTKB with different k values.
  • ...and 3 more figures