Do LLMs Act Like Rational Agents? Measuring Belief Coherence in Probabilistic Decision Making
Khurram Yamin, Jingjing Tang, Santiago Cortes-Gomez, Amit Sharma, Eric Horvitz, Bryan Wilder
TL;DR
This work develops a decision-theoretic framework to assess whether elicited probabilities from LLMs reflect subjective beliefs that drive actions under uncertainty. By deriving falsifiable conditions from random utility and prospect theory models, it links beliefs to choices via tests like $I(A;\theta\mid p)=0$ and monotone pairwise probabilities, and applies these to four medical-diagnosis tasks across multiple LLMs. Across real-world and expert-network data, the study finds systematic but imperfect alignment between elicited beliefs and decisions, with robust evidence of belief insufficiency and model-specific deviations. The results highlight how decision-oriented evaluation can diagnose and guide improvements in high-stakes LLM decision support, while clearly acknowledging that such tests falsify rather than prove rationality. Overall, the paper provides a practical, falsification-based toolkit for diagnosing decision-relevant beliefs in LLMs with implications for safety and reliability in medical contexts.
Abstract
Large language models (LLMs) are increasingly deployed as agents in high-stakes domains where optimal actions depend on both uncertainty about the world and consideration of utilities of different outcomes, yet their decision logic remains difficult to interpret. We study whether LLMs are rational utility maximizers with coherent beliefs and stable preferences. We consider behaviors of models for diagnosis challenge problems. The results provide insights about the relationship of LLM inferences to ideal Bayesian utility maximization for elicited probabilities and observed actions. Our approach provides falsifiable conditions under which the reported probabilities \emph{cannot} correspond to the true beliefs of any rational agent. We apply this methodology to multiple medical diagnostic domains with evaluations across several LLMs. We discuss implications of the results and directions forward for uses of LLMs in guiding high-stakes decisions.
