Table of Contents
Fetching ...

Are language models rational? The case of coherence norms and belief revision

Thomas Hofweber, Peter Hase, Elias Stengel-Eskin, Mohit Bansal

TL;DR

This paper asks whether language models are rational in the sense of coherence norms and belief revision. It develops a framework to treat internal belief-like states in LMs, introducing the Minimal Assent Connection (MAC) to derive a proposition credence cr(p) from next-token probabilities, with cr(p) = as(p) / (as(p) + ds(p)) and practical Yes-No approximations. It argues that synchronic coherence norms (logical coherence and probabilistic credence coherence) plausibly apply only to fine-tuned, truth-seeking models (e.g., RLHF or truth-curated training), while pretrained LMs remain arational. For belief revision, the paper discusses Bayesian-style updates (including Jeffrey conditioning) as diachronic norms that would require coherent adjustment of related beliefs after external edits, though true perceptual evidence for language models remains a challenge. Overall, it provides a principled route to predict and explain LM behavior through representational states, highlighting when rational norms are relevant and how to measure them, with implications for AI safety and alignment.

Abstract

Do norms of rationality apply to machine learning models, in particular language models? In this paper we investigate this question by focusing on a special subset of rational norms: coherence norms. We consider both logical coherence norms as well as coherence norms tied to the strength of belief. To make sense of the latter, we introduce the Minimal Assent Connection (MAC) and propose a new account of credence, which captures the strength of belief in language models. This proposal uniformly assigns strength of belief simply on the basis of model internal next token probabilities. We argue that rational norms tied to coherence do apply to some language models, but not to others. This issue is significant since rationality is closely tied to predicting and explaining behavior, and thus it is connected to considerations about AI safety and alignment, as well as understanding model behavior more generally.

Are language models rational? The case of coherence norms and belief revision

TL;DR

This paper asks whether language models are rational in the sense of coherence norms and belief revision. It develops a framework to treat internal belief-like states in LMs, introducing the Minimal Assent Connection (MAC) to derive a proposition credence cr(p) from next-token probabilities, with cr(p) = as(p) / (as(p) + ds(p)) and practical Yes-No approximations. It argues that synchronic coherence norms (logical coherence and probabilistic credence coherence) plausibly apply only to fine-tuned, truth-seeking models (e.g., RLHF or truth-curated training), while pretrained LMs remain arational. For belief revision, the paper discusses Bayesian-style updates (including Jeffrey conditioning) as diachronic norms that would require coherent adjustment of related beliefs after external edits, though true perceptual evidence for language models remains a challenge. Overall, it provides a principled route to predict and explain LM behavior through representational states, highlighting when rational norms are relevant and how to measure them, with implications for AI safety and alignment.

Abstract

Do norms of rationality apply to machine learning models, in particular language models? In this paper we investigate this question by focusing on a special subset of rational norms: coherence norms. We consider both logical coherence norms as well as coherence norms tied to the strength of belief. To make sense of the latter, we introduce the Minimal Assent Connection (MAC) and propose a new account of credence, which captures the strength of belief in language models. This proposal uniformly assigns strength of belief simply on the basis of model internal next token probabilities. We argue that rational norms tied to coherence do apply to some language models, but not to others. This issue is significant since rationality is closely tied to predicting and explaining behavior, and thus it is connected to considerations about AI safety and alignment, as well as understanding model behavior more generally.
Paper Structure (9 sections, 9 equations)