Protoknowledge Shapes Behaviour of LLMs in Downstream Tasks: Memorization and Generalization with Knowledge Graphs
Federico Ranaldi, Andrea Zugarini, Leonardo Ranaldi, Fabio Massimo Zanzotto
TL;DR
This work formalizes KG protoknowledge as the memorized, reusable knowledge within LLMs, decomposing it into lexical, hierarchical, and topological forms. It introduces Knowledge Activation Tasks (KATs) to probe each form and links protoknowledge absorption to downstream Text-to-SPARQL performance under different prompting scenarios. Across experiments with GPT-4, GPT-3.5-Turbo, and Llama variants on Wikidata and DBpedia, topological protoknowledge emerges as the strongest predictor of correct SPARQL generation, while lexical and hierarchical forms also contribute, especially under low-context prompts. The study highlights a persistent semantic bias shaped by pretraining data distribution, underscores risks of semantic-level data contamination, and provides a practical framework for evaluating and leveraging protoknowledge in Closed-Pretraining settings.
Abstract
We introduce the concept of protoknowledge to formalize and measure how sequences of tokens encoding Knowledge Graphs are internalized during pretraining and utilized at inference time by Large Language Models (LLMs). Indeed, LLMs have demonstrated the ability to memorize vast amounts of token sequences during pretraining, and a central open question is how they leverage this memorization as reusable knowledge through generalization. We then categorize protoknowledge into lexical, hierarchical, and topological forms, varying on the type of knowledge that needs to be activated. We measure protoknowledge through Knowledge Activation Tasks (KATs), analyzing its general properties such as semantic bias. We then investigate the impact of protoknowledge on Text-to-SPARQL performance by varying prompting strategies depending on input conditions. To this end, we adopt a novel analysis framework that assesses whether model predictions align with the successful activation of the relevant protoknowledge for each query. This methodology provides a practical tool to explore Semantic-Level Data Contamination and serves as an effective strategy for Closed-Pretraining models.
