Standards for Belief Representations in LLMs
Daniel A. Herrmann, Benjamin A. Levinstein
TL;DR
The paper addresses the lack of a unified theoretical foundation for belief representations in LLMs and proposes four adequacy criteria—accuracy, coherence, uniformity, and use—grounded in decision theory and formal epistemology. It advocates internal probes to uncover belief-like representations, discusses potential diachronic considerations, and situates measurement within a practice-informed framework. The main contribution is a coherent, deployable framework for identifying and validating belief-like internal states, along with an explicit discussion of limitations and the potential safety and interpretability benefits. If such representations exist and satisfy the criteria, they could improve explainability, trust, and safety in LLM deployment by enabling robust honesty checks and cross-domain belief monitoring.
Abstract
As large language models (LLMs) continue to demonstrate remarkable abilities across various domains, computer scientists are developing methods to understand their cognitive processes, particularly concerning how (and if) LLMs internally represent their beliefs about the world. However, this field currently lacks a unified theoretical foundation to underpin the study of belief in LLMs. This article begins filling this gap by proposing adequacy conditions for a representation in an LLM to count as belief-like. We argue that, while the project of belief measurement in LLMs shares striking features with belief measurement as carried out in decision theory and formal epistemology, it also differs in ways that should change how we measure belief. Thus, drawing from insights in philosophy and contemporary practices of machine learning, we establish four criteria that balance theoretical considerations with practical constraints. Our proposed criteria include accuracy, coherence, uniformity, and use, which together help lay the groundwork for a comprehensive understanding of belief representation in LLMs. We draw on empirical work showing the limitations of using various criteria in isolation to identify belief representations.
