Table of Contents
Fetching ...

GPT-ology, Computational Models, Silicon Sampling: How should we think about LLMs in Cognitive Science?

Desmond C. Ong

TL;DR

This paper surveys how cognitive science uses large language models by outlining three main paradigms—GPT-ology, LLMs as computational models, and silicon sampling—and arguing for a bird's-eye framework to assess epistemic status and reliability. It highlights core methodological challenges, including model access, prompt sensitivity, data provenance, and reproducibility, that threaten robust inferences. The authors emphasize the need for standard conventions, open-source evaluation, and attention to generalizability to ensure lasting insights as LLM technology evolves. Overall, the paper advocates a cautious, standards-driven approach to integrating LLMs into cognitive science that prioritizes reliability, transparency, and long-term interpretability.

Abstract

Large Language Models have taken the cognitive science world by storm. It is perhaps timely now to take stock of the various research paradigms that have been used to make scientific inferences about ``cognition" in these models or about human cognition. We review several emerging research paradigms -- GPT-ology, LLMs-as-computational-models, and ``silicon sampling" -- and review recent papers that have used LLMs under these paradigms. In doing so, we discuss their claims as well as challenges to scientific inference under these various paradigms. We highlight several outstanding issues about LLMs that have to be addressed to push our science forward: closed-source vs open-sourced models; (the lack of visibility of) training data; and reproducibility in LLM research, including forming conventions on new task ``hyperparameters" like instructions and prompts.

GPT-ology, Computational Models, Silicon Sampling: How should we think about LLMs in Cognitive Science?

TL;DR

This paper surveys how cognitive science uses large language models by outlining three main paradigms—GPT-ology, LLMs as computational models, and silicon sampling—and arguing for a bird's-eye framework to assess epistemic status and reliability. It highlights core methodological challenges, including model access, prompt sensitivity, data provenance, and reproducibility, that threaten robust inferences. The authors emphasize the need for standard conventions, open-source evaluation, and attention to generalizability to ensure lasting insights as LLM technology evolves. Overall, the paper advocates a cautious, standards-driven approach to integrating LLMs into cognitive science that prioritizes reliability, transparency, and long-term interpretability.

Abstract

Large Language Models have taken the cognitive science world by storm. It is perhaps timely now to take stock of the various research paradigms that have been used to make scientific inferences about ``cognition" in these models or about human cognition. We review several emerging research paradigms -- GPT-ology, LLMs-as-computational-models, and ``silicon sampling" -- and review recent papers that have used LLMs under these paradigms. In doing so, we discuss their claims as well as challenges to scientific inference under these various paradigms. We highlight several outstanding issues about LLMs that have to be addressed to push our science forward: closed-source vs open-sourced models; (the lack of visibility of) training data; and reproducibility in LLM research, including forming conventions on new task ``hyperparameters" like instructions and prompts.
Paper Structure (18 sections, 1 figure)

This paper contains 18 sections, 1 figure.

Figures (1)

  • Figure 1: The same experiment, assessing LLM performance on a given task---in this cartoon, presenting an LLM with a choice, and the LLM output is star---leads to different inferences based on the initial research questions. Researchers may make inferences about the capabilities of specific LLMs ("GPT-ology"), such as: "GPT can star-ify". Alternatively, we could use LLMs as a computational model of human learning. One example inference that could be made is that "statistical learning alone is sufficient for star-ifying". And finally, we could treat samples from an LLM under some conditioning contexts as illustrative of how people might respond in that manner ("under $<$X$>$ conditions, people may star-ify"). We note that these paradigms are not exhaustive (more creative ones could appear), nor are they mutually exclusive; the same paper or research program could make various claims.