Unveiling LLMs: The Evolution of Latent Representations in a Dynamic Knowledge Graph
Marco Bronzini, Carlo Nicolini, Bruno Lepri, Jacopo Staiano, Andrea Passerini
TL;DR
The paper tackles how LLMs internally encode factual knowledge for claim verification and introduces an end-to-end framework that decodes this latent information into ground predicates using activation patching, then represents the results as a dynamic knowledge graph that evolves across model layers. By applying this approach to FEVER and CLIMATE-FEVER with a 7B LLaMA2 model, the authors demonstrate both local interpretability (entity centrality and multi-hop reasoning) and global interpretability (layer-wise evolution patterns and transaction points). The key contributions are: (i) an activation-patching pipeline that converts token representations into structured facts without training, (ii) a graph-based representation capturing the temporal evolution of knowledge, and (iii) analyses revealing how factual information shifts from word-level to claim-level facts and how representation errors can lead to incorrect evaluations. These insights advance mechanistic interpretability and offer a framework for diagnosing and understanding the factual knowledge resolution process in LLMs with practical implications for bias and reliability.
Abstract
Large Language Models (LLMs) demonstrate an impressive capacity to recall a vast range of factual knowledge. However, understanding their underlying reasoning and internal mechanisms in exploiting this knowledge remains a key research area. This work unveils the factual information an LLM represents internally for sentence-level claim verification. We propose an end-to-end framework to decode factual knowledge embedded in token representations from a vector space to a set of ground predicates, showing its layer-wise evolution using a dynamic knowledge graph. Our framework employs activation patching, a vector-level technique that alters a token representation during inference, to extract encoded knowledge. Accordingly, we neither rely on training nor external models. Using factual and common-sense claims from two claim verification datasets, we showcase interpretability analyses at local and global levels. The local analysis highlights entity centrality in LLM reasoning, from claim-related information and multi-hop reasoning to representation errors causing erroneous evaluation. On the other hand, the global reveals trends in the underlying evolution, such as word-based knowledge evolving into claim-related facts. By interpreting semantics from LLM latent representations and enabling graph-related analyses, this work enhances the understanding of the factual knowledge resolution process.
