Language Models are Bounded Pragmatic Speakers: Understanding RLHF from a Bayesian Cognitive Modeling Perspective
Khanh Nguyen
TL;DR
This paper proposes the bounded pragmatic speaker as a Bayesian cognitive model to analyze large language models and their RLHF-based alignment. It shows that LLMs can be viewed as modular BPS instances, with a base speaker and a theory-of-mind listener derived from the model itself, framing RLHF as variational inference within this architecture. The authors argue that RLHF captures only a rudimentary slow-thinking system and highlight limitations in counterfactual and long-term reasoning, advocating world models and richer feedback to enable better knowledge transfer to fast-thinking components. They outline directions toward a dual model of thought, including advanced world modeling, richer communication, and more efficient inference algorithms, aiming to bridge cognitive science and reinforcement learning for more capable, interpretable AI. The work emphasizes the potential of Bayesian cognitive modeling to guide the development and interpretation of future LLMs and RLHF-based systems, with practical implications for safety, control, and scalability.
Abstract
How do language models "think"? This paper formulates a probabilistic cognitive model called the bounded pragmatic speaker, which can characterize the operation of different variations of language models. Specifically, we demonstrate that large language models fine-tuned with reinforcement learning from human feedback (Ouyang et al., 2022) embody a model of thought that conceptually resembles a fast-and-slow model (Kahneman, 2011), which psychologists have attributed to humans. We discuss the limitations of reinforcement learning from human feedback as a fast-and-slow model of thought and propose avenues for expanding this framework. In essence, our research highlights the value of adopting a cognitive probabilistic modeling approach to gain insights into the comprehension, evaluation, and advancement of language models.
