Incoherent Probability Judgments in Large Language Models

Jian-Qiao Zhu; Thomas L. Griffiths

Incoherent Probability Judgments in Large Language Models

Jian-Qiao Zhu, Thomas L. Griffiths

TL;DR

The paper investigates whether autoregressive LLMs produce coherent probability judgments by applying probabilistic identities and eliciting repeated judgments across 24 event pairs. Using prompts that constrain outputs to $[0,1]$ and analyzing four models at different temperatures, the study finds systematic, human-like incoherence in the identities and a mean-variance inversion in repeated judgments. The authors show that these patterns align more closely with a Bayesian Sampler account than with additive noise models, connecting the autoregressive objective to implicit Bayesian inference via de Finetti exchangeability. This work suggests that adjusting the degree of inferred incoherence, rather than purely calibrating to frequencies, could improve AI probability judgments and highlights a productive synergy between Bayesian theory and neural networks for understanding machine and human cognition.

Abstract

Autoregressive Large Language Models (LLMs) trained for next-word prediction have demonstrated remarkable proficiency at producing coherent text. But are they equally adept at forming coherent probability judgments? We use probabilistic identities and repeated judgments to assess the coherence of probability judgments made by LLMs. Our results show that the judgments produced by these models are often incoherent, displaying human-like systematic deviations from the rules of probability theory. Moreover, when prompted to judge the same event, the mean-variance relationship of probability judgments produced by LLMs shows an inverted-U-shaped like that seen in humans. We propose that these deviations from rationality can be explained by linking autoregressive LLMs to implicit Bayesian inference and drawing parallels with the Bayesian Sampler model of human probability judgments.

Incoherent Probability Judgments in Large Language Models

TL;DR

and analyzing four models at different temperatures, the study finds systematic, human-like incoherence in the identities and a mean-variance inversion in repeated judgments. The authors show that these patterns align more closely with a Bayesian Sampler account than with additive noise models, connecting the autoregressive objective to implicit Bayesian inference via de Finetti exchangeability. This work suggests that adjusting the degree of inferred incoherence, rather than purely calibrating to frequencies, could improve AI probability judgments and highlights a productive synergy between Bayesian theory and neural networks for understanding machine and human cognition.

Abstract

Paper Structure (11 sections, 7 equations, 3 figures, 1 table)

This paper contains 11 sections, 7 equations, 3 figures, 1 table.

Introduction
Background
Probabilities in Large Language Models
Assessing Coherence via Probabilistic Identities
Repeated Judgments
Evaluating Coherence in LLMs
Methods
Results
Theories of Human Probability Judgments
Connecting LLMs to Bayesian Inference
Discussion

Figures (3)

Figure 1: Bias and variability in human probability judgments as revealed by (left) probabilistic identities and (right) mean-variance relationship. Error bars are 95% CI. Solid line represents the best-fitting linear regression. Data adapted from zhu2020bayesian Experiment 1.
Figure 2: Probabilistic identities based on LLM responses. For coherent judgments, all probabilistic identities should be zero. (A) GPT-3.5-turbo model. (B) GPT-4 model. (C) LLaMA-2 model with 7b parameters. (D) LLaMA-2 model with 70b parameters. Error bars represent 95% CI across the 24 event pairs.
Figure 3: The relationship between mean and variance in repeated probability judgments produced by LLMs exhibits an inverted-U shape. Solid lines represent the best-fitting regression models. Variants with more parameters or lower temperatures tend to shift the curve outward and downward, suggesting more consistent judgments. (A) GPT-3.5-turbo model. (B) GPT-4 model. (C) LLaMA-2 model with 7b parameters. (D) LLaMA-2 model with 70b parameters.

Incoherent Probability Judgments in Large Language Models

TL;DR

Abstract

Incoherent Probability Judgments in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)