Bayesian Statistical Modeling with Predictors from LLMs
Michael Franke, Polina Tsvilodub, Fausto Carcassi
TL;DR
This work probes whether LLM-derived predictions can be integrated into Bayesian models of human behavior, using a text-based reference-game paradigm to compare item-level LLM predictions with human data and RSA-based probabilistic pragmatics. It demonstrates that item-level LLM predictions inject variance not consistently observed in humans, while some aggregation schemes (notably average-WTA) can capture condition-level data for certain backends (e.g., GPT-3.5, some LLaMA2 variants). The study provides a methodological framework for criticising LLM-based predictors with posterior predictive checks and highlights that aggregation choices and backend differences critically shape fit and interpretation. The findings call for caution when using LLMs as explanatory models or substitutes for human subjects, while suggesting avenues for integrating LLMs into hybrid probabilistic models under principled aggregation and validation. Overall, LLMs can contribute distributional predictions at the aggregate level, but their item-level variance may not align with human data, underscoring the need for careful, modelspecific evaluation in cognitive applications.
Abstract
State of the art large language models (LLMs) have shown impressive performance on a variety of benchmark tasks and are increasingly used as components in larger applications, where LLM-based predictions serve as proxies for human judgements or decision. This raises questions about the human-likeness of LLM-derived information, alignment with human intuition, and whether LLMs could possibly be considered (parts of) explanatory models of (aspects of) human cognition or language use. To shed more light on these issues, we here investigate the human-likeness of LLMs' predictions for multiple-choice decision tasks from the perspective of Bayesian statistical modeling. Using human data from a forced-choice experiment on pragmatic language use, we find that LLMs do not capture the variance in the human data at the item-level. We suggest different ways of deriving full distributional predictions from LLMs for aggregate, condition-level data, and find that some, but not all ways of obtaining condition-level predictions yield adequate fits to human data. These results suggests that assessment of LLM performance depends strongly on seemingly subtle choices in methodology, and that LLMs are at best predictors of human behavior at the aggregate, condition-level, for which they are, however, not designed to, or usually used to, make predictions in the first place.
