Table of Contents
Fetching ...

Reasoning aligns language models to human cognition

Gonçalo Guiomar, Elia Torre, Pehuen Moure, Victoria Shavina, Mario Giulianelli, Shih-Chii Liu, Valerio Mante

TL;DR

This work investigates how language models decide under uncertainty and the role of chain-of-thought reasoning in aligning with human cognition. By introducing an active probabilistic reasoning task that separates evidence sampling from inference and by fitting a four-parameter mechanistic model, the authors compare humans and a broad suite of LLMs against near-optimal policies. Extended reasoning chiefly boosts inference by reducing biases and sharpening belief-to-choice mappings, placing agents in a shared cognitive space but leaving a gap in active information acquisition. The study provides a principled framework for evaluating alignment that links observable behavior to interpretable latent computations and charts directions for improving sampling efficiency in LLMs.

Abstract

Do language models make decisions under uncertainty like humans do, and what role does chain-of-thought (CoT) reasoning play in the underlying decision process? We introduce an active probabilistic reasoning task that cleanly separates sampling (actively acquiring evidence) from inference (integrating evidence toward a decision). Benchmarking humans and a broad set of contemporary large language models against near-optimal reference policies reveals a consistent pattern: extended reasoning is the key determinant of strong performance, driving large gains in inference and producing belief trajectories that become strikingly human-like, while yielding only modest improvements in active sampling. To explain these differences, we fit a mechanistic model that captures systematic deviations from optimal behavior via four interpretable latent variables: memory, strategy, choice bias, and occlusion awareness. This model places humans and models in a shared low-dimensional cognitive space, reproduces behavioral signatures across agents, and shows how chain-of-thought shifts language models toward human-like regimes of evidence accumulation and belief-to-choice mapping, tightening alignment in inference while leaving a persistent gap in information acquisition.

Reasoning aligns language models to human cognition

TL;DR

This work investigates how language models decide under uncertainty and the role of chain-of-thought reasoning in aligning with human cognition. By introducing an active probabilistic reasoning task that separates evidence sampling from inference and by fitting a four-parameter mechanistic model, the authors compare humans and a broad suite of LLMs against near-optimal policies. Extended reasoning chiefly boosts inference by reducing biases and sharpening belief-to-choice mappings, placing agents in a shared cognitive space but leaving a gap in active information acquisition. The study provides a principled framework for evaluating alignment that links observable behavior to interpretable latent computations and charts directions for improving sampling efficiency in LLMs.

Abstract

Do language models make decisions under uncertainty like humans do, and what role does chain-of-thought (CoT) reasoning play in the underlying decision process? We introduce an active probabilistic reasoning task that cleanly separates sampling (actively acquiring evidence) from inference (integrating evidence toward a decision). Benchmarking humans and a broad set of contemporary large language models against near-optimal reference policies reveals a consistent pattern: extended reasoning is the key determinant of strong performance, driving large gains in inference and producing belief trajectories that become strikingly human-like, while yielding only modest improvements in active sampling. To explain these differences, we fit a mechanistic model that captures systematic deviations from optimal behavior via four interpretable latent variables: memory, strategy, choice bias, and occlusion awareness. This model places humans and models in a shared low-dimensional cognitive space, reproduces behavioral signatures across agents, and shows how chain-of-thought shifts language models toward human-like regimes of evidence accumulation and belief-to-choice mapping, tightening alignment in inference while leaving a persistent gap in information acquisition.
Paper Structure (61 sections, 86 equations, 13 figures)

This paper contains 61 sections, 86 equations, 13 figures.

Figures (13)

  • Figure 1: From task performance to latent cognitive variables.A: Task. We introduce an active probabilistic reasoning task in which agents sequentially sample from up to four buttons (A–D), each revealing a binary outcome (RED/GREEN). One button is biased toward RED, while the others are unbiased. During $N$sampling rounds, agents actively choose what evidence to sample among the buttons available on a given round. In a final inference round, agents indicate which button they believe is biased. An equivalent text-based version is used for LLMs (Appendix\ref{['sec:prompts']}). B: Behavior. We compare human and LLM behavior by quantifying overall performance, sampling/inference quality, and invalid choices. These metrics reveal a broad spectrum of performance, with extended chain-of-thought reasoning improving overall success via enhanced inference, while gains in sampling remain limited. C: Mechanisms. To move beyond behavioral scores, we fit a mechanistic model that captures deviations from optimal Bayesian inference using four interpretable latent variables: Memory ($\beta$), Strategy ($\kappa$), Choice Bias ($\omega$), and Occlusion Awareness ($\theta$). D: Cognitive space. These latent variables define a shared low-dimensional cognitive space in which humans and models can be positioned. Reasoning shifts LLMs toward human-like inference strategies, and tightens, but does not fully close, the gap in sampling strategies.
  • Figure 2: Comparing human and LLM behavior.A: Task performance. Average success rate across trial lengths $N\in\{2,\dots,15\}$. We report human performance (green), split into lower $75\%$ and top $25\%$ of participants and the near-optimal reference agent (PPO sampling + MAP inference) (light blue). For models that support increased reasoning effort, gray overlays indicate the Extended Reasoning condition. Error bars represent standard deviations, computed across trial-cluster means with a uniform distribution over the number of rounds. The vertical line marks chance performance (25%). B: Sampling and inference quality. We quantify sampling quality (left) and inference quality (right) as performance loss with respect to the near-optimal agent (lower is better): inference loss is the gap between the agent and a counterfactual agent that preserves the same sampled evidence but applies MAP at the inference round for the final decision; sampling loss is the gap between this counterfactual MAP agent and the reference agent (PPO + MAP), isolating suboptimal evidence acquisition. Reasoning primarily reduces inference loss, with only modest effects on sampling loss. C: Invalid choices. Fraction of invalid choices during sampling (left) and at the final, inference decision (right). Invalid choices include selecting occluded options, producing tokens outside the valid choice set (A--D), or failing to respond; humans cannot produce invalid choices in the graphical interface. Invalid choices occur more frequently during sampling than at the final decision and are reduced by reasoning.
  • Figure 3: Model parameters explain human and LLM behaviorA: Memory parameter $\beta$ governs non-optimal evidence-integration, spanning stubborn ($\beta<0$) to forgetful ($\beta>0$) regimes. B: Strategy parameter $\kappa$ controls choice stochasticity (random near $\kappa=0$, increasingly rational for large $\kappa$), fit separately for sampling (left) and inference (right). C: Entropy of Choice-bias vector $\omega^x$ captures deviations from internal posterior-driven decisions, fit separately for sampling and inference (more biased for smaller entropy). D: Occlusion Awareness parameter (shown as $\log \theta$) captures sensitivity to occlusions (invalid choices). E: Model schematic: choices generate evidence that updates a Bayesian posterior $p_t$; memory ($\beta$) produces a non-optimal information accumulation process $h_t$; strategy ($\kappa$) applies an inverse-temperature transformation; bias ($\omega$) and awareness ($\theta$) modulate the resulting policy $\pi_t$.
  • Figure 4: Reasoning aligns language models to human cognition.A: Cognitive space. Agents embedded by fitted memory$\beta$ (stubborn$\beta<0$ to forgetful$\beta>0$) and inference strategy$\kappa_f$ (more rational for larger $\kappa_f$). Near-optimal agent not shown ($\kappa_f$ beyond y-axis limits). Shading/contours show predicted success across fitted model $(\beta,\kappa_f)$; markers show humans (green/light green) and LLMs under low/no reasoning (black border) versus extended reasoning (white border). Reasoning shifts models toward the high-success, human-like regime. B: Latent-variable dynamics. Shows a specific evidence sequence (top row) and resulting round-by-round dynamics of latent variables (lower rows) during a game by gpt oss 20b under high- (left) vs. low-reasoning effort (right). Game-dependent terms (log-likelihood increments $\Delta h_t$ and the posterior $p_t$) are identical across efforts, while agent-dependent quantities differ: low-reasoning yields a stubborn memory state trajectory $h_t$ (4th row; $\beta<0$) and diffuse policy$\pi_t$ (5th row; smaller $\kappa_f$), whereas high-reasoning shows near-optimal memory updates ($\beta\!\approx\!0$) and a sharper, more decisive belief-to-choice mapping (larger $\kappa_f$).
  • Figure 5: Evolution of success rate across rounds. For each agent, we report the mean success rate (fraction of trials in which the final reported button is the true biased one) as a function of the number of sampling rounds $N\in\{2,\dots,15\}$. Humans are shown in green, split into the top $25\%$ and lower $75\%$ of participants (same split as in Fig. \ref{['fig:bench']}A); LLMs are shown as individual traces (legend), with colors indicating each model’s overall average success rate (lighter/orange = lower, darker/purple = higher). Across models, this view reveals a qualitative separation around $\sim 45\%$ average success: below this regime, curves remain approximately flat with increasing $N$, suggesting limited ability to convert additional evidence into improved final decisions; above it, curves exhibit a clear positive slope (“lift-off”), indicating effective inference in longer games. Claude haiku 3.5 is the first model in the performance ranking to show this lift-off, and from DeepSeek R1 0528 Qwen3 8B onward, many models track the round-by-round improvement profile of the lower $75\%$ human cohort. Overall, models that display lift-off are predominantly those with extended reasoning.
  • ...and 8 more figures