Table of Contents
Fetching ...

Humans and LLMs Diverge on Probabilistic Inferences

Gaurav Kamath, Sreenath Madathil, Sebastian Schuster, Marie-Catherine de Marneffe, Siva Reddy

TL;DR

ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, is introduced, and it is shown that human responses are graded and varied, revealing probabilistic judgments of the inferences in the dataset.

Abstract

Human reasoning often involves working over limited information to arrive at probabilistic conclusions. In its simplest form, this involves making an inference that is not strictly entailed by a premise, but rather only likely given the premise. While reasoning LLMs have demonstrated strong performance on logical and mathematical tasks, their behavior on such open-ended, non-deterministic inferences remains largely unexplored. We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, each annotated for inference likelihood by 25--30 human participants. We find that human responses are graded and varied, revealing probabilistic judgments of the inferences in our dataset. Comparing these judgments with responses from eight state-of-the-art reasoning LLMs, we show that models consistently fail to produce human-like distributions. Finally, analyzing LLM reasoning chains, we find evidence of a common reasoning pattern used to evaluate such inferences. Our findings reveal persistent differences between humans and LLMs, and underscore the need to evaluate reasoning beyond deterministic settings.

Humans and LLMs Diverge on Probabilistic Inferences

TL;DR

ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, is introduced, and it is shown that human responses are graded and varied, revealing probabilistic judgments of the inferences in the dataset.

Abstract

Human reasoning often involves working over limited information to arrive at probabilistic conclusions. In its simplest form, this involves making an inference that is not strictly entailed by a premise, but rather only likely given the premise. While reasoning LLMs have demonstrated strong performance on logical and mathematical tasks, their behavior on such open-ended, non-deterministic inferences remains largely unexplored. We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, each annotated for inference likelihood by 25--30 human participants. We find that human responses are graded and varied, revealing probabilistic judgments of the inferences in our dataset. Comparing these judgments with responses from eight state-of-the-art reasoning LLMs, we show that models consistently fail to produce human-like distributions. Finally, analyzing LLM reasoning chains, we find evidence of a common reasoning pattern used to evaluate such inferences. Our findings reveal persistent differences between humans and LLMs, and underscore the need to evaluate reasoning beyond deterministic settings.
Paper Structure (45 sections, 19 figures, 5 tables)

This paper contains 45 sections, 19 figures, 5 tables.

Figures (19)

  • Figure 1: High-level overview of our paper. We use ProbCOPA, a novel dataset of probabilistic inferences, to collect judgments of inference likelihood from humans and models, and study how well their respective judgment distributions align with one another.
  • Figure 2: Distribution of human responses to ProbCOPA. Top-right: Likelihood scores across the entire dataset are tri-modal, with a significant proportion of responses between these modes; Top-right: likelihood scores for individual items typically follow a truncated normal distribution; Bottom-left: items with median responses towards extreme ends of the scale are subject to lower inter-annotator disagreement than for those in the middle ranges; Bottom-right: items with higher inter-annotator disagreement are (weakly) correlated with loinger response times from participants.
  • Figure 3: Distribution of likelihood scores across all ProbCOPA items, from three models. In contrast to humans (see \ref{['fig:human-response-dist-joint']}), models rarely return responses indicating medium likelihood, though this tendency is less extreme with GPT-5. See \ref{['fig:model-response-distributions-all']} for the full set of distributions by model.
  • Figure 4: Item-wise comparisons between Gemini-3 and humans. Top-left: median likelihood scores from Gemini-3 align with those from humans at extreme ends of the scale, but not in the middle ranges; Bottom-left: likelihood score distributions from Gemini-3 and humans reflect the same pattern, with highest divergences for middle-range items (which also saw less inter-annotator agreement); Top-right: Gemini-3 shows less response diversity that humans for all items; Bottom-right: Gemini-3 on average reasons longer for items that humans disagree more on.
  • Figure 5: Distribution of item-wise Wasserstein distances between human and model likelihood score distributions. Ensembling the outputs of all models yields better distributional alignment with human judgments, but still falls short of the human-human baseline.
  • ...and 14 more figures