Table of Contents
Fetching ...

Standards for Belief Representations in LLMs

Daniel A. Herrmann, Benjamin A. Levinstein

TL;DR

The paper addresses the lack of a unified theoretical foundation for belief representations in LLMs and proposes four adequacy criteria—accuracy, coherence, uniformity, and use—grounded in decision theory and formal epistemology. It advocates internal probes to uncover belief-like representations, discusses potential diachronic considerations, and situates measurement within a practice-informed framework. The main contribution is a coherent, deployable framework for identifying and validating belief-like internal states, along with an explicit discussion of limitations and the potential safety and interpretability benefits. If such representations exist and satisfy the criteria, they could improve explainability, trust, and safety in LLM deployment by enabling robust honesty checks and cross-domain belief monitoring.

Abstract

As large language models (LLMs) continue to demonstrate remarkable abilities across various domains, computer scientists are developing methods to understand their cognitive processes, particularly concerning how (and if) LLMs internally represent their beliefs about the world. However, this field currently lacks a unified theoretical foundation to underpin the study of belief in LLMs. This article begins filling this gap by proposing adequacy conditions for a representation in an LLM to count as belief-like. We argue that, while the project of belief measurement in LLMs shares striking features with belief measurement as carried out in decision theory and formal epistemology, it also differs in ways that should change how we measure belief. Thus, drawing from insights in philosophy and contemporary practices of machine learning, we establish four criteria that balance theoretical considerations with practical constraints. Our proposed criteria include accuracy, coherence, uniformity, and use, which together help lay the groundwork for a comprehensive understanding of belief representation in LLMs. We draw on empirical work showing the limitations of using various criteria in isolation to identify belief representations.

Standards for Belief Representations in LLMs

TL;DR

The paper addresses the lack of a unified theoretical foundation for belief representations in LLMs and proposes four adequacy criteria—accuracy, coherence, uniformity, and use—grounded in decision theory and formal epistemology. It advocates internal probes to uncover belief-like representations, discusses potential diachronic considerations, and situates measurement within a practice-informed framework. The main contribution is a coherent, deployable framework for identifying and validating belief-like internal states, along with an explicit discussion of limitations and the potential safety and interpretability benefits. If such representations exist and satisfy the criteria, they could improve explainability, trust, and safety in LLM deployment by enabling robust honesty checks and cross-domain belief monitoring.

Abstract

As large language models (LLMs) continue to demonstrate remarkable abilities across various domains, computer scientists are developing methods to understand their cognitive processes, particularly concerning how (and if) LLMs internally represent their beliefs about the world. However, this field currently lacks a unified theoretical foundation to underpin the study of belief in LLMs. This article begins filling this gap by proposing adequacy conditions for a representation in an LLM to count as belief-like. We argue that, while the project of belief measurement in LLMs shares striking features with belief measurement as carried out in decision theory and formal epistemology, it also differs in ways that should change how we measure belief. Thus, drawing from insights in philosophy and contemporary practices of machine learning, we establish four criteria that balance theoretical considerations with practical constraints. Our proposed criteria include accuracy, coherence, uniformity, and use, which together help lay the groundwork for a comprehensive understanding of belief representation in LLMs. We draw on empirical work showing the limitations of using various criteria in isolation to identify belief representations.
Paper Structure (13 sections, 6 figures)

This paper contains 13 sections, 6 figures.

Figures (6)

  • Figure 1: An illustration of an LLM on the left, and a probe on the right. A sentence is fed through the model. Some internal computation (such as an embedding vector) is extracted and input into the probe, which decodes it to recover the model's belief about the sentence.
  • Figure 2: In this toy example, the dots correspond to internal activations input into a probe for different prompts. Blue dots represent true claims, and red dots represent false claims. In this image, truth and falsity appear well-distinguished internally along the black arrow. Because this is merely a toy example with axes corresponding to hypothetical dimensions in activation space, we do not label the axes or assume any type of scale.
  • Figure 3: As before, in the toy example, blue dots represent some internal activations corresponding to true claims, and red dots represent false claims. On the left, truth and falsity are well-separated by the relevant activations, and the probe should be able to detect such a separation to achieve high accuracy. In the middle, the probe should achieve medium accuracy, and on the right, there is virtually no separation, so the probe should achieve only low accuracy.
  • Figure 4: In the toy example here, we stipulate the orange dots correspond to activations for "A is to the left of B" and "B is to the right of A"; the purple dots correspond to "Paris is in France, and Toronto is in Canada," and "Toronto is in Canada, and Paris is in France"; and the green dot corresponds to "A is not to the left of B". The black arrow corresponds to the direction of truth. If the representations found are coherent, then the purple dots should be close together along the direction of truth. The orange dots should also be close together and far from the green dot along the direction of truth. The plot on the left, then, illustrates a fairly coherent pattern of activations, whereas the plot on the right does not.
  • Figure 5: In the toy example, blue and red represent true and false claims (respectively) for sentences about one domain, for example, sentences about cities, while green and orange represent true and false claims (respectively) for sentences about a different domain, for example, sentences about plants. On the left, there is a consistent direction of truth in the model's representation space for both domains, suggesting high uniformity. On the right, the directions of truth are almost orthogonal, suggesting low uniformity.
  • ...and 1 more figures