Table of Contents
Fetching ...

Who is More Bayesian: Humans or ChatGPT?

Tianshi Mu, Pranjal Rawat, John Rust, Chengjun Zhang, Qixuan Zhong

TL;DR

The study rigorously benchmarks Bayesian rationality in humans and AI by reanalyzing classic urn-ball experiments and replicating them with ChatGPT variants. It develops a structural logit framework to infer subjective beliefs and unobserved heterogeneity, revealing that humans are heterogeneous yet highly efficient on average, while AI subjects rapidly close the gap and eventually surpass human Bayesian performance (notably with GPT-4o). The work highlights how priors, data, and calculational noise shape decision rules, and shows that advanced AI can approximate or achieve near-perfect Bayes classifications, albeit with distinct error patterns and context effects. Together, these findings illuminate the evolving landscape of Bayesian rationality in both biological and artificial decision-makers and offer a principled methodology for cross-domain comparisons with practical implications for diagnosis, risk assessment, and automated reasoning.

Abstract

We compare the performance of human and artificially intelligent (AI) decision makers in simple binary classification tasks where the optimal decision rule is given by Bayes Rule. We reanalyze choices of human subjects gathered from laboratory experiments conducted by El-Gamal and Grether and Holt and Smith. We confirm that while overall, Bayes Rule represents the single best model for predicting human choices, subjects are heterogeneous and a significant share of them make suboptimal choices that reflect judgement biases described by Kahneman and Tversky that include the ``representativeness heuristic'' (excessive weight on the evidence from the sample relative to the prior) and ``conservatism'' (excessive weight on the prior relative to the sample). We compare the performance of AI subjects gathered from recent versions of large language models (LLMs) including several versions of ChatGPT. These general-purpose generative AI chatbots are not specifically trained to do well in narrow decision making tasks, but are trained instead as ``language predictors'' using a large corpus of textual data from the web. We show that ChatGPT is also subject to biases that result in suboptimal decisions. However we document a rapid evolution in the performance of ChatGPT from sub-human performance for early versions (ChatGPT 3.5) to superhuman and nearly perfect Bayesian classifications in the latest versions (ChatGPT 4o).

Who is More Bayesian: Humans or ChatGPT?

TL;DR

The study rigorously benchmarks Bayesian rationality in humans and AI by reanalyzing classic urn-ball experiments and replicating them with ChatGPT variants. It develops a structural logit framework to infer subjective beliefs and unobserved heterogeneity, revealing that humans are heterogeneous yet highly efficient on average, while AI subjects rapidly close the gap and eventually surpass human Bayesian performance (notably with GPT-4o). The work highlights how priors, data, and calculational noise shape decision rules, and shows that advanced AI can approximate or achieve near-perfect Bayes classifications, albeit with distinct error patterns and context effects. Together, these findings illuminate the evolving landscape of Bayesian rationality in both biological and artificial decision-makers and offer a principled methodology for cross-domain comparisons with practical implications for diagnosis, risk assessment, and automated reasoning.

Abstract

We compare the performance of human and artificially intelligent (AI) decision makers in simple binary classification tasks where the optimal decision rule is given by Bayes Rule. We reanalyze choices of human subjects gathered from laboratory experiments conducted by El-Gamal and Grether and Holt and Smith. We confirm that while overall, Bayes Rule represents the single best model for predicting human choices, subjects are heterogeneous and a significant share of them make suboptimal choices that reflect judgement biases described by Kahneman and Tversky that include the ``representativeness heuristic'' (excessive weight on the evidence from the sample relative to the prior) and ``conservatism'' (excessive weight on the prior relative to the sample). We compare the performance of AI subjects gathered from recent versions of large language models (LLMs) including several versions of ChatGPT. These general-purpose generative AI chatbots are not specifically trained to do well in narrow decision making tasks, but are trained instead as ``language predictors'' using a large corpus of textual data from the web. We show that ChatGPT is also subject to biases that result in suboptimal decisions. However we document a rapid evolution in the performance of ChatGPT from sub-human performance for early versions (ChatGPT 3.5) to superhuman and nearly perfect Bayesian classifications in the latest versions (ChatGPT 4o).

Paper Structure

This paper contains 29 sections, 2 theorems, 44 equations, 22 figures, 8 tables, 1 algorithm.

Key Result

Lemma L1

The optimal decision rule for a statistical experiment with a binomial design can be defined in terms of Bayes Rule by

Figures (22)

  • Figure 1: Which group of subjects, 1 or 2, are GPT and which are human?
  • Figure 2: Example of weak identification of subjective posterior beliefs
  • Figure 3: Comparison of subject behavior and models in the California experiments
  • Figure 4: Inferred Posterior beliefs of California subjects from EC and FM algorithms
  • Figure 5: CCPs implied by EC and FM models of California subjects
  • ...and 17 more figures

Theorems & Definitions (6)

  • Definition D1
  • Definition D2
  • Definition D3
  • Definition D4
  • Lemma L1
  • Lemma L2