Table of Contents
Fetching ...

Bayesian Statistical Modeling with Predictors from LLMs

Michael Franke, Polina Tsvilodub, Fausto Carcassi

TL;DR

This work probes whether LLM-derived predictions can be integrated into Bayesian models of human behavior, using a text-based reference-game paradigm to compare item-level LLM predictions with human data and RSA-based probabilistic pragmatics. It demonstrates that item-level LLM predictions inject variance not consistently observed in humans, while some aggregation schemes (notably average-WTA) can capture condition-level data for certain backends (e.g., GPT-3.5, some LLaMA2 variants). The study provides a methodological framework for criticising LLM-based predictors with posterior predictive checks and highlights that aggregation choices and backend differences critically shape fit and interpretation. The findings call for caution when using LLMs as explanatory models or substitutes for human subjects, while suggesting avenues for integrating LLMs into hybrid probabilistic models under principled aggregation and validation. Overall, LLMs can contribute distributional predictions at the aggregate level, but their item-level variance may not align with human data, underscoring the need for careful, modelspecific evaluation in cognitive applications.

Abstract

State of the art large language models (LLMs) have shown impressive performance on a variety of benchmark tasks and are increasingly used as components in larger applications, where LLM-based predictions serve as proxies for human judgements or decision. This raises questions about the human-likeness of LLM-derived information, alignment with human intuition, and whether LLMs could possibly be considered (parts of) explanatory models of (aspects of) human cognition or language use. To shed more light on these issues, we here investigate the human-likeness of LLMs' predictions for multiple-choice decision tasks from the perspective of Bayesian statistical modeling. Using human data from a forced-choice experiment on pragmatic language use, we find that LLMs do not capture the variance in the human data at the item-level. We suggest different ways of deriving full distributional predictions from LLMs for aggregate, condition-level data, and find that some, but not all ways of obtaining condition-level predictions yield adequate fits to human data. These results suggests that assessment of LLM performance depends strongly on seemingly subtle choices in methodology, and that LLMs are at best predictors of human behavior at the aggregate, condition-level, for which they are, however, not designed to, or usually used to, make predictions in the first place.

Bayesian Statistical Modeling with Predictors from LLMs

TL;DR

This work probes whether LLM-derived predictions can be integrated into Bayesian models of human behavior, using a text-based reference-game paradigm to compare item-level LLM predictions with human data and RSA-based probabilistic pragmatics. It demonstrates that item-level LLM predictions inject variance not consistently observed in humans, while some aggregation schemes (notably average-WTA) can capture condition-level data for certain backends (e.g., GPT-3.5, some LLaMA2 variants). The study provides a methodological framework for criticising LLM-based predictors with posterior predictive checks and highlights that aggregation choices and backend differences critically shape fit and interpretation. The findings call for caution when using LLMs as explanatory models or substitutes for human subjects, while suggesting avenues for integrating LLMs into hybrid probabilistic models under principled aggregation and validation. Overall, LLMs can contribute distributional predictions at the aggregate level, but their item-level variance may not align with human data, underscoring the need for careful, modelspecific evaluation in cognitive applications.

Abstract

State of the art large language models (LLMs) have shown impressive performance on a variety of benchmark tasks and are increasingly used as components in larger applications, where LLM-based predictions serve as proxies for human judgements or decision. This raises questions about the human-likeness of LLM-derived information, alignment with human intuition, and whether LLMs could possibly be considered (parts of) explanatory models of (aspects of) human cognition or language use. To shed more light on these issues, we here investigate the human-likeness of LLMs' predictions for multiple-choice decision tasks from the perspective of Bayesian statistical modeling. Using human data from a forced-choice experiment on pragmatic language use, we find that LLMs do not capture the variance in the human data at the item-level. We suggest different ways of deriving full distributional predictions from LLMs for aggregate, condition-level data, and find that some, but not all ways of obtaining condition-level predictions yield adequate fits to human data. These results suggests that assessment of LLM performance depends strongly on seemingly subtle choices in methodology, and that LLMs are at best predictors of human behavior at the aggregate, condition-level, for which they are, however, not designed to, or usually used to, make predictions in the first place.
Paper Structure (26 sections, 14 equations, 10 figures, 2 tables)

This paper contains 26 sections, 14 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Schematic representation of key conceptual differences between different types of predictive models. Standard statistical models, like hierarchical regression models, typically make predictions for aggregate data (e.g., at the condition level), and add random offsets for item-level variation. More sophisticated models, like some probabilistic cognitive models, may holistically combine information from an aggregate level (task, condition) with specific information about items. LLMs, in contrast, first and foremost give prediction about each individual item and must rely on proper aggregation to arrive at condition-level predictions.
  • Figure 2: Structure of a reference game with human participants. Each trial consists of a set of objects, the so-called context. In production trials, participants choose a single word to describe a trigger object from the context. In interpretation trials, an object is selected as the likely object a trigger word is referring to.
  • Figure 3: Example of predictions from the RSA model. The semantic meaning is shown as a matrix of binary truth-values. The policies of literal listener, pragmatic speaker and listener are calculated for uniform priors over states (referents) for $\alpha=1$, and are shown as row-stochastic matrices.
  • Figure 4: Counts of choices from reference games with human participants (colored bars), with summary statistics from the posterior predictive distribution of four models (shapes and error bars). Shapes show the mean of the posterior predictive distributions of the RSA model and three aggregated condition-level predictors derived from item-level LLM scores (introduced in Section \ref{['llm-predictions-for-reference-games']}). Error-bars show corresponding 95% credible intervals of the posterior predictive.
  • Figure 5: Predicted probability of highest-scoring answer category, averaged over items, for different values of softmax parameter $\alpha$ (black lines). The gray-shaded area indicates the posterior 95% credible interval for $\alpha$, and the implied probabilistic prediction. For reference, the target probability under a random choice strategy and under the "winner-takes-all" (WTA) strategy are shown with dashed lines. For credible values of $\alpha$, the means of predicted probabilities for the target option are clearly distinct from the WTA strategy.
  • ...and 5 more figures