Who is More Bayesian: Humans or ChatGPT?
Tianshi Mu, Pranjal Rawat, John Rust, Chengjun Zhang, Qixuan Zhong
TL;DR
The study rigorously benchmarks Bayesian rationality in humans and AI by reanalyzing classic urn-ball experiments and replicating them with ChatGPT variants. It develops a structural logit framework to infer subjective beliefs and unobserved heterogeneity, revealing that humans are heterogeneous yet highly efficient on average, while AI subjects rapidly close the gap and eventually surpass human Bayesian performance (notably with GPT-4o). The work highlights how priors, data, and calculational noise shape decision rules, and shows that advanced AI can approximate or achieve near-perfect Bayes classifications, albeit with distinct error patterns and context effects. Together, these findings illuminate the evolving landscape of Bayesian rationality in both biological and artificial decision-makers and offer a principled methodology for cross-domain comparisons with practical implications for diagnosis, risk assessment, and automated reasoning.
Abstract
We compare the performance of human and artificially intelligent (AI) decision makers in simple binary classification tasks where the optimal decision rule is given by Bayes Rule. We reanalyze choices of human subjects gathered from laboratory experiments conducted by El-Gamal and Grether and Holt and Smith. We confirm that while overall, Bayes Rule represents the single best model for predicting human choices, subjects are heterogeneous and a significant share of them make suboptimal choices that reflect judgement biases described by Kahneman and Tversky that include the ``representativeness heuristic'' (excessive weight on the evidence from the sample relative to the prior) and ``conservatism'' (excessive weight on the prior relative to the sample). We compare the performance of AI subjects gathered from recent versions of large language models (LLMs) including several versions of ChatGPT. These general-purpose generative AI chatbots are not specifically trained to do well in narrow decision making tasks, but are trained instead as ``language predictors'' using a large corpus of textual data from the web. We show that ChatGPT is also subject to biases that result in suboptimal decisions. However we document a rapid evolution in the performance of ChatGPT from sub-human performance for early versions (ChatGPT 3.5) to superhuman and nearly perfect Bayesian classifications in the latest versions (ChatGPT 4o).
