Foundations of LLM Knowledge Materialization: Termination, Reproducibility, Robustness
Luca Giordano, Simon Razniewski
TL;DR
This work systematically investigates the termination, reproducibility, and robustness of LLM knowledge materialization using domain-specific miniGPTKBs. It introduces miniGPTKBs (Babylon history, The Big Bang Theory entertainment, DAX 40 finance) and uses a two-phase GPTKB workflow to extract (s,p,o) triples from seeds, evaluating termination, yield, lexical similarity, and semantic similarity across seeds, languages, randomness, and models. Key findings show high termination rates in several settings but language and model variations can hinder termination; reproducibility is mixed with stable yields but modest lexical alignment and strong semantic consistency, while robustness favors seed and temperature perturbations but is weaker for language and model changes; ensembling significantly improves stability. The results demonstrate that GPTKB-style knowledge materialization can reliably surface core knowledge for domain-specific LLM outputs, while highlighting important limitations for multilingual and model-variant contexts and providing practical guidance for building stable, reusable knowledge graphs from LLMs. The work contributes empirical evidence and methodological tools (miniGPTKBs, evaluation metrics, and ensembling strategies) toward more interpretable and reproducible LLM knowledge extraction with potential impact on QA, knowledge integration, and domain-specific AI systems.
Abstract
Large Language Models (LLMs) encode substantial factual knowledge, yet measuring and systematizing this knowledge remains challenging. Converting it into structured format, for example through recursive extraction approaches such as the GPTKB methodology (Hu et al., 2025b), is still underexplored. Key open questions include whether such extraction can terminate, whether its outputs are reproducible, and how robust they are to variations. We systematically study LLM knowledge materialization using miniGPTKBs (domain-specific, tractable subcrawls), analyzing termination, reproducibility, and robustness across three categories of metrics: yield, lexical similarity, and semantic similarity. We experiment with four variations (seed, language, randomness, model) and three illustrative domains (from history, entertainment, and finance). Our findings show (i) high termination rates, though model-dependent; (ii) mixed reproducibility; and (iii) robustness that varies by perturbation type: high for seeds and temperature, lower for languages and models. These results suggest that LLM knowledge materialization can reliably surface core knowledge, while also revealing important limitations.
