Table of Contents
Fetching ...

Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs

Julian Ma, Jun Wang, Zafeirios Fountas

TL;DR

The paper investigates whether large language models exhibit Bayesian perceptual strategies by introducing BayesBench, a psychophysics-inspired benchmark with four magnitude estimation tasks across text and image modalities. It presents a formal modelling framework (linear, static Bayesian, and Kalman-filter observers), plus cue-combination models and a Bayesian Consistency Score to assess robustness under controlled ablations. The results show that capable LLMs often adopt Bayes-consistent behavior and benefit from multimodal cues, but high accuracy alone does not guarantee robust, Bayes-like integration, as illustrated by GPT-5 Mini’s strong text accuracy yet weak cue integration. BayesBench provides a diagnostic tool that highlights uncertainty handling and cue combination tendencies, offering a valuable complement to traditional accuracy-based evaluation and guiding next-generation multimodal architectures.

Abstract

Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance, behaviour and efficiency in multimodal cue-combination. Beyond accuracy and efficiency metrics, we introduce a Bayesian Consistency Score that detects Bayes-consistent behavioural shifts even when accuracy saturates. Our results show that while capable models often adapt in Bayes-consistent ways, accuracy does not guarantee robustness. Notably, GPT-5 Mini achieves perfect text accuracy but fails to integrate visual cues efficiently. This reveals a critical dissociation between capability and strategy, suggesting accuracy-centric benchmarks may over-index on performance while missing brittle uncertainty handling. These findings reveal emergent principled handling of uncertainty and highlight the correlation between accuracy and Bayesian tendencies. We release our psychophysics benchmark and consistency metric (https://bayes-bench.github.io) as evaluation tools and to inform future multimodal architecture designs.

Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs

TL;DR

The paper investigates whether large language models exhibit Bayesian perceptual strategies by introducing BayesBench, a psychophysics-inspired benchmark with four magnitude estimation tasks across text and image modalities. It presents a formal modelling framework (linear, static Bayesian, and Kalman-filter observers), plus cue-combination models and a Bayesian Consistency Score to assess robustness under controlled ablations. The results show that capable LLMs often adopt Bayes-consistent behavior and benefit from multimodal cues, but high accuracy alone does not guarantee robust, Bayes-like integration, as illustrated by GPT-5 Mini’s strong text accuracy yet weak cue integration. BayesBench provides a diagnostic tool that highlights uncertainty handling and cue combination tendencies, offering a valuable complement to traditional accuracy-based evaluation and guiding next-generation multimodal architectures.

Abstract

Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance, behaviour and efficiency in multimodal cue-combination. Beyond accuracy and efficiency metrics, we introduce a Bayesian Consistency Score that detects Bayes-consistent behavioural shifts even when accuracy saturates. Our results show that while capable models often adapt in Bayes-consistent ways, accuracy does not guarantee robustness. Notably, GPT-5 Mini achieves perfect text accuracy but fails to integrate visual cues efficiently. This reveals a critical dissociation between capability and strategy, suggesting accuracy-centric benchmarks may over-index on performance while missing brittle uncertainty handling. These findings reveal emergent principled handling of uncertainty and highlight the correlation between accuracy and Bayesian tendencies. We release our psychophysics benchmark and consistency metric (https://bayes-bench.github.io) as evaluation tools and to inform future multimodal architecture designs.

Paper Structure

This paper contains 61 sections, 9 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Comparison of LLMs vs human behaviour A: Llama-4 Maverick in one of the line length ratio estimation experiments. The fitted lines are based on a static Bayesian observer model. Light dots are individual data points B: Response from typical human psychophysics studies (adapted from thurley2016magnitude). We see in both that there is a regression to the mean effect, where responses are biased towards the centre of the stimulus range.
  • Figure 2: Example of the four magnitude estimation tasks. Cues in a blue background represent information provided as text, while orange represents vision.
  • Figure 3: GPT-5 Mini's mean response (to verbal cues) compared to prediction based on a sequential Bayes model (dotted line)
  • Figure 4: GPT-5 Mini's mean response trajectory (verbal cue). Arrows denote the sequence of its responses.
  • Figure 5: Example distribution of stimulus input for the marker location task
  • ...and 12 more figures