Table of Contents
Fetching ...

Generating Computational Cognitive Models using Large Language Models

Milena Rmus, Akshay K. Jagadish, Marvin Mathony, Tobias Ludwig, Eric Schulz

TL;DR

This work introduces GeCCo, a guided-generation pipeline that uses open-source Large Language Models to generate executable cognitive models as Python functions, iteratively refining them with held-out data feedback. Across decision making, learning, planning, and working memory, GeCCo-produced models match or exceed domain-specific baselines in predictive accuracy ($BIC$) and posterior predictive checks, while capturing substantial explainable variance comparable to a foundation model. The approach relies on in-context learning, a hybrid LLM-plus-optimization loop, and robust control analyses to demonstrate the scalability, interpretability, and domain generalizability of LLM-driven cognitive-model discovery. The findings suggest LLMs can democratize cognitive-model generation, accelerate theory development, and inspire new theoretical insights by revealing compact, empirically competitive models across diverse task domains.

Abstract

Computational cognitive models, which formalize theories of cognition, enable researchers to quantify cognitive processes and arbitrate between competing theories by fitting models to behavioral data. Traditionally, these models are handcrafted, which requires significant domain knowledge, coding expertise, and time investment. However, recent advances in machine learning offer solutions to these challenges. In particular, Large Language Models (LLMs) have demonstrated remarkable capabilities for in-context pattern recognition, leveraging knowledge from diverse domains to solve complex problems, and generating executable code that can be used to facilitate the generation of cognitive models. Building on this potential, we introduce a pipeline for Guided generation of Computational Cognitive Models (GeCCo). Given task instructions, participant data, and a template function, GeCCo prompts an LLM to propose candidate models, fits proposals to held-out data, and iteratively refines them based on feedback constructed from their predictive performance. We benchmark this approach across four different cognitive domains -- decision making, learning, planning, and memory -- using three open-source LLMs, spanning different model sizes, capacities, and families. On four human behavioral data sets, the LLM generated models that consistently matched or outperformed the best domain-specific models from the cognitive science literature. Taken together, our results suggest that LLMs can generate cognitive models with conceptually plausible theories that rival -- or even surpass -- the best models from the literature across diverse task domains.

Generating Computational Cognitive Models using Large Language Models

TL;DR

This work introduces GeCCo, a guided-generation pipeline that uses open-source Large Language Models to generate executable cognitive models as Python functions, iteratively refining them with held-out data feedback. Across decision making, learning, planning, and working memory, GeCCo-produced models match or exceed domain-specific baselines in predictive accuracy () and posterior predictive checks, while capturing substantial explainable variance comparable to a foundation model. The approach relies on in-context learning, a hybrid LLM-plus-optimization loop, and robust control analyses to demonstrate the scalability, interpretability, and domain generalizability of LLM-driven cognitive-model discovery. The findings suggest LLMs can democratize cognitive-model generation, accelerate theory development, and inspire new theoretical insights by revealing compact, empirically competitive models across diverse task domains.

Abstract

Computational cognitive models, which formalize theories of cognition, enable researchers to quantify cognitive processes and arbitrate between competing theories by fitting models to behavioral data. Traditionally, these models are handcrafted, which requires significant domain knowledge, coding expertise, and time investment. However, recent advances in machine learning offer solutions to these challenges. In particular, Large Language Models (LLMs) have demonstrated remarkable capabilities for in-context pattern recognition, leveraging knowledge from diverse domains to solve complex problems, and generating executable code that can be used to facilitate the generation of cognitive models. Building on this potential, we introduce a pipeline for Guided generation of Computational Cognitive Models (GeCCo). Given task instructions, participant data, and a template function, GeCCo prompts an LLM to propose candidate models, fits proposals to held-out data, and iteratively refines them based on feedback constructed from their predictive performance. We benchmark this approach across four different cognitive domains -- decision making, learning, planning, and memory -- using three open-source LLMs, spanning different model sizes, capacities, and families. On four human behavioral data sets, the LLM generated models that consistently matched or outperformed the best domain-specific models from the cognitive science literature. Taken together, our results suggest that LLMs can generate cognitive models with conceptually plausible theories that rival -- or even surpass -- the best models from the literature across diverse task domains.

Paper Structure

This paper contains 74 sections, 20 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Schematic of GeCCo: We prompt the LLM with a task description, participant data, guardrails to constrain the format of LLM responses, and the code template to generate cognitive models that offer different explanations of the underlying data as Python functions. Model generation evolves over 10 sampling iterations. During each iteration, three LLM-generated models are fitted offline to the held out data (not included in the prompt), and the fitness metric Bayesian Information Criterion (BIC; watanabe2013widely) is used to provide feedback to the LLM on the subsequent iteration. The best model across all 10 sampling iterations is used for evaluation. The LLM-generated models are evaluated by 1) fitting them to behavioral data and comparing the model fit to that of the baseline cognitive model (e.g. the best performing model from the literature) using BIC, and 2) running posterior predictive checks - i.e. simulating the models and comparing simulated to ground truth data - to further verify their validity. For full prompts, see the Appendix \ref{['all_prompts']}.
  • Figure 2: Experiment 1: Decision Making. A) Schematic of the decision task from hilbig2014generalized, where participants were asked to choose between two options based on four binary features and their validities. Arrow thickness indicates validity value (i.e., thicker arrows mean higher validity). B) Model fit comparison: LLM-generated models from R1 and Llama outperformed the best literature model. C) Posterior predictive checks showed that proportions of choices accounted for by the canonical heuristics (Equal Weighting, Take The Best and Weighted Additive Heuristics) closely aligned between human data and respective LLM and pWADD model predictions. D) Code of the best LLM-generated model (Llama), which can arbitrate between three canonical heuristics mentioned above via a discount factor.
  • Figure 3: Experiment 2: Learning. A) Schematic of the learning task from chambon2020information, where participants chose between two options and received feedback for both, but only got the reward for the chosen option. B) Model fit comparison: LLM-generated models from Llama on average fit better than $RW^{4\alpha}$. C) Posterior predictive checks showing close alignment between human data and predictions of the best LLM model in both low and high reward blocks. D) Code of the best LLM-generated model (Llama), which displays asymmetric learning rates along with forgetting of values, and a dedicated fictive trace for counterfactual outcomes.
  • Figure 4: Experiment 3: Planning. A) Schematic of the planning task from feher2020humans, where participants took two steps in a stochastic environment with common and rare transitions and fluctuating rewards. B) Model fit comparison: the model generated by R1 had a lower average BIC score compared to the Hybrid model. C) Posterior predictive checks showing the same pattern across humans, literature model and R1-generated model of repetition of common (dark) and rare (bright) transitions depending on if the previous action was rewarded. D) Code of the best LLM-generated model (R1), which uses separate learning rates for common and rare transitions and inverse temperature for exploration -- omitting discounting of rewards.
  • Figure 5: Experiment 4: Working Memory. A) Schematic of the reinforcement learning - working memory task from rmus2023age, where participants learned a varying number of state-action associations. B) Model fit comparison: Llama-generated models outperformed the best literature model. C) Posterior predictive checks showing close alignment between human data and predictions of the best LLM model. D) Code of the best LLM-generated model (Llama), which distinguishes between fast learning under low cognitive load (working memory) and slower learning under high cognitive load.
  • ...and 9 more figures