Table of Contents
Fetching ...

Structural Hallucination in Large Language Models: A Network-Based Evaluation of Knowledge Organization and Citation Integrity

Moses Boudourides

TL;DR

The proposed stress test provides a reproducible instrument for evaluating the structural integrity of LLM-generated knowledge representations within knowledge organization and information quality research and shows that structural fidelity cannot be inferred from local fluency alone.

Abstract

Large Language Models (LLMs) increasingly mediate access to scholarly information, yet their outputs are typically evaluated at the level of individual statements rather than knowledge structure. This paper introduces structural hallucination: systematic distortion of conceptual organization, relational architecture, and bibliographic grounding that remains invisible to sentence-level accuracy metrics. To detect such distortions, we develop a network-based hallucination stress test grounded in knowledge graph extraction, graph similarity analysis, centrality comparison, and citation integrity verification. The protocol is applied to three structured domains representing core forms of scholarly knowledge: Roget's Thesaurus (1911) as a lexical ontology, Wikidata philosophers as a biographical knowledge graph, and bibliographic citation records retrieved from the Dimensions.ai database. Across all domains, substantial structural divergence is observed. In the lexical benchmark, macro-averaged F1 scores fall below 0.05; in the biographical benchmark, hallucination rates exceed 93%; and in the bibliometric benchmark, citation omission reaches 91.9%. Network-level comparison in the Roget reconstruction further reveals node-set Jaccard similarity of 0.028 and fabrication rates above 94%. These findings show that structural fidelity cannot be inferred from local fluency alone. The proposed stress test provides a reproducible instrument for evaluating the structural integrity of LLM-generated knowledge representations within knowledge organization and information quality research.

Structural Hallucination in Large Language Models: A Network-Based Evaluation of Knowledge Organization and Citation Integrity

TL;DR

The proposed stress test provides a reproducible instrument for evaluating the structural integrity of LLM-generated knowledge representations within knowledge organization and information quality research and shows that structural fidelity cannot be inferred from local fluency alone.

Abstract

Large Language Models (LLMs) increasingly mediate access to scholarly information, yet their outputs are typically evaluated at the level of individual statements rather than knowledge structure. This paper introduces structural hallucination: systematic distortion of conceptual organization, relational architecture, and bibliographic grounding that remains invisible to sentence-level accuracy metrics. To detect such distortions, we develop a network-based hallucination stress test grounded in knowledge graph extraction, graph similarity analysis, centrality comparison, and citation integrity verification. The protocol is applied to three structured domains representing core forms of scholarly knowledge: Roget's Thesaurus (1911) as a lexical ontology, Wikidata philosophers as a biographical knowledge graph, and bibliographic citation records retrieved from the Dimensions.ai database. Across all domains, substantial structural divergence is observed. In the lexical benchmark, macro-averaged F1 scores fall below 0.05; in the biographical benchmark, hallucination rates exceed 93%; and in the bibliometric benchmark, citation omission reaches 91.9%. Network-level comparison in the Roget reconstruction further reveals node-set Jaccard similarity of 0.028 and fabrication rates above 94%. These findings show that structural fidelity cannot be inferred from local fluency alone. The proposed stress test provides a reproducible instrument for evaluating the structural integrity of LLM-generated knowledge representations within knowledge organization and information quality research.
Paper Structure (73 sections, 1 equation, 13 figures, 8 tables)

This paper contains 73 sections, 1 equation, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Classical evaluation of LLM reconstruction of Roget's Thesaurus ($n = 30$ Heads). The four panels show (top left) TP/FP/FN counts per field, (top right) precision, recall, F1, and accuracy per field, (bottom left) metric heatmap across all fields, and (bottom right) F1 score per field with reference thresholds at 0.4 and 0.7. The dominant category in the TP/FP/FN panel is false negatives, reflecting the model's near-total inability to recall the 1911 vocabulary. All F1 scores are below 0.05; adverbs score exactly zero.
  • Figure 2: Semantic similarity scores (cosine similarity using all-MiniLM-L6-v2) between ground-truth Roget term lists and LLM-generated lists, per field. Error bars show one standard deviation across the 30 sampled Heads. Nouns and adjectives achieve moderate similarity (0.43--0.47), indicating conceptual alignment despite lexical divergence. Adverbs and cross-references score below 0.30, reflecting the model's tendency to hallucinate content for these fields.
  • Figure 3: ROC curve for the hallucination classifier trained on the Roget benchmark. The near-perfect AUC confirms that hallucination is a systematic, structurally predictable failure mode rather than random error. The classifier uses token overlap, edit distance, and semantic similarity as features to distinguish hallucinated from genuine term matches.
  • Figure 4: Directed knowledge graphs for the Roget (left) and LLM-generated (right) datasets. Each cluster represents one of the 30 sampled Heads. The visual similarity of the two graphs conceals a near-total divergence in node content: the LLM has reproduced the structural form of the ontology while replacing virtually all of its substance with fabricated terms. This is the defining visual signature of structural hallucination.
  • Figure 5: PageRank comparison between the top nodes of the Roget graph (blue) and LLM-generated graph (orange), with gold bars indicating LLM-only fabricated nodes that have been assigned high PageRank by the model. In the Roget graph, PageRank is concentrated on Head nodes, which are the canonical conceptual anchors of the ontology. In the LLM-generated graph, fabricated term nodes are elevated to positions of structural influence, displacing the canonical hierarchy.
  • ...and 8 more figures