Table of Contents
Fetching ...

Do LLMs Dream of Ontologies?

Marco Bombieri, Paolo Fiorini, Simone Paolo Ponzetto, Marco Rospocher

TL;DR

This work probes whether general-purpose LLMs memorize the ID–label associations of public ontologies. It introduces a zero-shot task to retrieve ontology IDs from labels across GO, Uberon, ICD-10, and Wikidata, evaluated on Pythia-12B, Gemini-1.5F, GPT-3.5, and GPT-4, revealing that memorization is generally limited and highly dependent on concept popularity on the Web. A strong correlation between Web exposure and memorization accuracy is reported, suggesting that training data content drives recall more than structured ontology ingestion. To gauge memorization robustness, the authors propose prediction-invariance metrics under prompt perturbations and language variations, finding that invariance correlates with accuracy and can serve as a proxy for memorization. Collectively, the findings highlight the mixed ability of LLMs to recall ontological facts and point to invariance-based methods as practical tools for assessing structured knowledge memorization and guiding mitigation of hallucinations.

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance across diverse natural language processing tasks, yet their ability to memorize structured knowledge remains underexplored. In this paper, we investigate the extent to which general-purpose pre-trained LLMs retain and correctly reproduce concept identifier (ID)-label associations from publicly available ontologies. We conduct a systematic evaluation across multiple ontological resources, including the Gene Ontology, Uberon, Wikidata, and ICD-10, using LLMs such as Pythia-12B, Gemini-1.5-Flash, GPT-3.5, and GPT-4. Our findings reveal that only a small fraction of ontological concepts is accurately memorized, with GPT-4 demonstrating the highest performance. To understand why certain concepts are memorized more effectively than others, we analyze the relationship between memorization accuracy and concept popularity on the Web. Our results indicate a strong correlation between the frequency of a concept's occurrence online and the likelihood of accurately retrieving its ID from the label. This suggests that LLMs primarily acquire such knowledge through indirect textual exposure rather than directly from structured ontological resources. Furthermore, we introduce new metrics to quantify prediction invariance, demonstrating that the stability of model responses across variations in prompt language and temperature settings can serve as a proxy for estimating memorization robustness.

Do LLMs Dream of Ontologies?

TL;DR

This work probes whether general-purpose LLMs memorize the ID–label associations of public ontologies. It introduces a zero-shot task to retrieve ontology IDs from labels across GO, Uberon, ICD-10, and Wikidata, evaluated on Pythia-12B, Gemini-1.5F, GPT-3.5, and GPT-4, revealing that memorization is generally limited and highly dependent on concept popularity on the Web. A strong correlation between Web exposure and memorization accuracy is reported, suggesting that training data content drives recall more than structured ontology ingestion. To gauge memorization robustness, the authors propose prediction-invariance metrics under prompt perturbations and language variations, finding that invariance correlates with accuracy and can serve as a proxy for memorization. Collectively, the findings highlight the mixed ability of LLMs to recall ontological facts and point to invariance-based methods as practical tools for assessing structured knowledge memorization and guiding mitigation of hallucinations.

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance across diverse natural language processing tasks, yet their ability to memorize structured knowledge remains underexplored. In this paper, we investigate the extent to which general-purpose pre-trained LLMs retain and correctly reproduce concept identifier (ID)-label associations from publicly available ontologies. We conduct a systematic evaluation across multiple ontological resources, including the Gene Ontology, Uberon, Wikidata, and ICD-10, using LLMs such as Pythia-12B, Gemini-1.5-Flash, GPT-3.5, and GPT-4. Our findings reveal that only a small fraction of ontological concepts is accurately memorized, with GPT-4 demonstrating the highest performance. To understand why certain concepts are memorized more effectively than others, we analyze the relationship between memorization accuracy and concept popularity on the Web. Our results indicate a strong correlation between the frequency of a concept's occurrence online and the likelihood of accurately retrieving its ID from the label. This suggests that LLMs primarily acquire such knowledge through indirect textual exposure rather than directly from structured ontological resources. Furthermore, we introduce new metrics to quantify prediction invariance, demonstrating that the stability of model responses across variations in prompt language and temperature settings can serve as a proxy for estimating memorization robustness.
Paper Structure (33 sections, 3 equations, 5 figures, 6 tables)

This paper contains 33 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Distribution of the number of Web occurrences (log-scaled, base $e$) for each dataset (Gene Ontology, Uberon, ICD-10, and Wikidata).
  • Figure 2: Average accuracy of the model's ID prediction (y-axis) according to the popularity of the concept on the Web (represented by the bucket number in x-axis) in the Gene Ontology (a) and ICD-10 (b).
  • Figure 3: Variation over all the buckets of the $R_{B_{i}}$ values, capturing the ratio between the number of top-500 repeated IDs that belong to bucket $B_{i}$, and the proportion of them that should be in the bucket $B_{i}$ according to the overall distribution of all IDs in the buckets for Gene Ontology (\ref{['fig:bias-go']}) and ICD-10 (\ref{['fig:bias-icd']}).
  • Figure 4: Variation of AvPI (left) and accuracy (right) on the different buckets of the Gene Ontology when applying the PI-1, PI-2, and PI-3 invariance strategies to GPT-3.5 (Subfigures \ref{['fig:PI-GO-GP3']}-\ref{['fig:PI2-GO-GP3']}), Gemini-1.5F (Subfigures \ref{['fig:PI-GO-GEM']}-\ref{['fig:PI2-GO-GEM']}) and Pythia-12B (Subfigures \ref{['fig:PI-GO-PYT']}-\ref{['fig:PI2-GO-PYT']})
  • Figure 5: Variation of AvPI (left) and accuracy (right) on the different buckets of ICD-10 when applying the PI-1, PI-2, and PI-3 invariance strategies to GPT-3.5 (Subfigures \ref{['fig:PI-GO-GP3']}-\ref{['fig:PI2-GO-GP3']}), Gemini-1.5F (Subfigures \ref{['fig:PI-ICD-GEM']}-\ref{['fig:PI2-ICD-GEM']}) and Pythia-12B (Subfigures \ref{['fig:PI-ICD-PYT']}-\ref{['fig:PI2-ICD-PYT']})