Humans and LLMs Diverge on Probabilistic Inferences

Gaurav Kamath; Sreenath Madathil; Sebastian Schuster; Marie-Catherine de Marneffe; Siva Reddy

Humans and LLMs Diverge on Probabilistic Inferences

Gaurav Kamath, Sreenath Madathil, Sebastian Schuster, Marie-Catherine de Marneffe, Siva Reddy

TL;DR

ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, is introduced, and it is shown that human responses are graded and varied, revealing probabilistic judgments of the inferences in the dataset.

Abstract

Human reasoning often involves working over limited information to arrive at probabilistic conclusions. In its simplest form, this involves making an inference that is not strictly entailed by a premise, but rather only likely given the premise. While reasoning LLMs have demonstrated strong performance on logical and mathematical tasks, their behavior on such open-ended, non-deterministic inferences remains largely unexplored. We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, each annotated for inference likelihood by 25--30 human participants. We find that human responses are graded and varied, revealing probabilistic judgments of the inferences in our dataset. Comparing these judgments with responses from eight state-of-the-art reasoning LLMs, we show that models consistently fail to produce human-like distributions. Finally, analyzing LLM reasoning chains, we find evidence of a common reasoning pattern used to evaluate such inferences. Our findings reveal persistent differences between humans and LLMs, and underscore the need to evaluate reasoning beyond deterministic settings.

Humans and LLMs Diverge on Probabilistic Inferences

TL;DR

Abstract

Paper Structure (45 sections, 19 figures, 5 tables)

This paper contains 45 sections, 19 figures, 5 tables.

Introduction
The ProbCOPA Dataset
Data Construction
Human Annotation Procedure
Reproducibility of Human Responses
Analysis of Human Responses
Methodology
On Normalizing Human Responses
Metric for Response Spread
Results
Likelihood scores from humans reveal graded, probabilistic judgments.
Human likelihood score distributions are almost always unimodal.
Annotators do not collectively agree on a hypothesis having medium likelihood.
Higher entropy items are (weakly) correlated with longer human response times.
Comparison with Responses from Reasoning LLMs
...and 30 more sections

Figures (19)

Figure 1: High-level overview of our paper. We use ProbCOPA, a novel dataset of probabilistic inferences, to collect judgments of inference likelihood from humans and models, and study how well their respective judgment distributions align with one another.
Figure 2: Distribution of human responses to ProbCOPA. Top-right: Likelihood scores across the entire dataset are tri-modal, with a significant proportion of responses between these modes; Top-right: likelihood scores for individual items typically follow a truncated normal distribution; Bottom-left: items with median responses towards extreme ends of the scale are subject to lower inter-annotator disagreement than for those in the middle ranges; Bottom-right: items with higher inter-annotator disagreement are (weakly) correlated with loinger response times from participants.
Figure 3: Distribution of likelihood scores across all ProbCOPA items, from three models. In contrast to humans (see \ref{['fig:human-response-dist-joint']}), models rarely return responses indicating medium likelihood, though this tendency is less extreme with GPT-5. See \ref{['fig:model-response-distributions-all']} for the full set of distributions by model.
Figure 4: Item-wise comparisons between Gemini-3 and humans. Top-left: median likelihood scores from Gemini-3 align with those from humans at extreme ends of the scale, but not in the middle ranges; Bottom-left: likelihood score distributions from Gemini-3 and humans reflect the same pattern, with highest divergences for middle-range items (which also saw less inter-annotator agreement); Top-right: Gemini-3 shows less response diversity that humans for all items; Bottom-right: Gemini-3 on average reasons longer for items that humans disagree more on.
Figure 5: Distribution of item-wise Wasserstein distances between human and model likelihood score distributions. Ensembling the outputs of all models yields better distributional alignment with human judgments, but still falls short of the human-human baseline.
...and 14 more figures

Humans and LLMs Diverge on Probabilistic Inferences

TL;DR

Abstract

Humans and LLMs Diverge on Probabilistic Inferences

Authors

TL;DR

Abstract

Table of Contents

Figures (19)