CogBench: a large language model walks into a psychology lab

Julian Coda-Forno; Marcel Binz; Jane X. Wang; Eric Schulz

CogBench: a large language model walks into a psychology lab

Julian Coda-Forno, Marcel Binz, Jane X. Wang, Eric Schulz

TL;DR

CogBench presents a behavioral benchmark for LLMs grounded in seven cognitive psychology tasks, moving beyond performance-centric evaluation. By deriving ten behavioral metrics and applying multilevel modeling to 35 models, the study reveals how size, RLHF, and prompt-engineering shape cognitive-like behaviors such as model-based reasoning and meta-cognition. Key findings include RLHF increasing human-likeness and meta-cognition, larger models boosting performance and model-basedness, and prompt-engineering techniques (CoT and SB) offering selective benefits. The work advocates for behavior-centric evaluation to complement traditional benchmarks and discusses limitations like transparency and generalizability, outlining directions for broader task coverage and automation.

Abstract

Large language models (LLMs) have significantly advanced the field of artificial intelligence. Yet, evaluating them comprehensively remains challenging. We argue that this is partly due to the predominant focus on performance metrics in most benchmarks. This paper introduces CogBench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments. This novel approach offers a toolkit for phenotyping LLMs' behavior. We apply CogBench to 35 LLMs, yielding a rich and diverse dataset. We analyze this data using statistical multilevel modeling techniques, accounting for the nested dependencies among fine-tuned versions of specific LLMs. Our study highlights the crucial role of model size and reinforcement learning from human feedback (RLHF) in improving performance and aligning with human behavior. Interestingly, we find that open-source models are less risk-prone than proprietary models and that fine-tuning on code does not necessarily enhance LLMs' behavior. Finally, we explore the effects of prompt-engineering techniques. We discover that chain-of-thought prompting improves probabilistic reasoning, while take-a-step-back prompting fosters model-based behaviors.

CogBench: a large language model walks into a psychology lab

TL;DR

Abstract

Paper Structure (50 sections, 4 equations, 7 figures)

This paper contains 50 sections, 4 equations, 7 figures.

Introduction
Related work
Methods
Prompting and summary of included models
High-level summary of tasks
The cognitive phenotype of LLMs
Performance summary
Differences between behavioral and performance metrics
Hypothesis-driven experiments
Impact of prompt-engineering
Discussion
List of LLMs used
Comprehensive list & explanation of the cognitive experiments
Probabilistic reasoning dasgupta2020theory - Prior $\&$ likelihood weighting
Summary
...and 35 more sections

Figures (7)

Figure 1: Overview of approach and methods. CogBench provides open access to seven different cognitive psychology experiments. These experiments are text-based and can be run to evaluate any LLM's behavior. The experiments are submitted to LLMs as textual prompts and the models indicate their choices by completing a given prompt. Past behavior is then concatenated to the prompt and learning is induced via prompt-chaining. We used 35 LLMs in total, including most larger proprietary LLMs as well as many open-source models.
Figure 2: CogBench results for established LLMs. A: Performance metrics, B: Behavioral metrics. All metrics are human-normalized: a value of zero corresponds to a random agent, while a value of one corresponds to the average human subject (dotted lines).
Figure 3: A: UMAP visualization of the ten behavioral metrics for all LLMs. Each point represents an LLM, with models using RLHF and models without RLHF indicated by different colors. B: Difference in average $L2$-norm with humans between RLHF models and non-RLHF models.
Figure 4: Multi-level regressions of LLMs features onto different performance or behavioral metrics. Red bars represent effects included in a hypothesis. A: Regression onto all task performances. B: Regression onto model-basedness. C: Regression onto meta-cognition. D: Regression onto risk taking. ***: $p<0.001$, ** : $0.001 \leq p<0.01$, * : $0.01 \leq p<0.05$
Figure 5: Difference of chain-of-thoughts and take-a-step-back prompting to baseline models on A: Posterior accuracy, B: Model-basedness. The aggregated scores are computed using a weighted average of all five models using inverse-variance weighting.
...and 2 more figures

CogBench: a large language model walks into a psychology lab

TL;DR

Abstract

CogBench: a large language model walks into a psychology lab

Authors

TL;DR

Abstract

Table of Contents

Figures (7)