Cognitive phantoms in LLMs through the lens of latent variables

Sanne Peereboom; Inga Schwabe; Bennett Kleinberg

Cognitive phantoms in LLMs through the lens of latent variables

Sanne Peereboom, Inga Schwabe, Bennett Kleinberg

TL;DR

This paper investigates whether psychometric instruments designed for humans can validly measure latent traits in large language models (LLMs), addressing the risk of cognitive phantoms. Using two validated questionnaires (HEXACO-60 and the Dark Side of Humanity Scale) administered to humans and three GPT-based LLMs, the authors compare latent structures through confirmatory and exploratory factor analyses, complemented by composite-score analyses. They find that human latent structures are replicable, but LLM responses yield arbitrary, non-reproducible factor structures and often fail factorability, casting doubt on the validity of applying these instruments to LLMs. The study argues for a latent-variable framework as essential for robust LLM evaluation and cautions against over-interpreting composite scores, highlighting implications for AI safety and future psychometric validation of LLM behavior.

Abstract

Large language models (LLMs) increasingly reach real-world applications, necessitating a better understanding of their behaviour. Their size and complexity complicate traditional assessment methods, causing the emergence of alternative approaches inspired by the field of psychology. Recent studies administering psychometric questionnaires to LLMs report human-like traits in LLMs, potentially influencing LLM behaviour. However, this approach suffers from a validity problem: it presupposes that these traits exist in LLMs and that they are measurable with tools designed for humans. Typical procedures rarely acknowledge the validity problem in LLMs, comparing and interpreting average LLM scores. This study investigates this problem by comparing latent structures of personality between humans and three LLMs using two validated personality questionnaires. Findings suggest that questionnaires designed for humans do not validly measure similar constructs in LLMs, and that these constructs may not exist in LLMs at all, highlighting the need for psychometric analyses of LLM responses to avoid chasing cognitive phantoms. Keywords: large language models, psychometrics, machine behaviour, latent variable modeling, validity

Cognitive phantoms in LLMs through the lens of latent variables

TL;DR

Abstract

Paper Structure (26 sections, 2 figures, 3 tables)

This paper contains 26 sections, 2 figures, 3 tables.

Introduction
Machine behaviour and machine psychology
Latent variables and psychometrics for LLMs
The validity problem for LLMs
The present study
Method
Human data
LLM data
Materials
Analysis plan
Latent variable approach
Composite score analysis
Results
Latent variable approach
Confirmatory factor analysis
...and 11 more sections

Figures (2)

Figure 1: (a) HEXACO-60 theoretical factor structure and HEXACO-60 item-factor correlations for EFAs in the (b) human sample; (c) GPT-3.5-T sample; and (d) GPT-4 sample. Nodes in the outer circle with the same colour and first letter theoretically belong to the same dimension ("_R" suffix indicates reverse-coded questions). The grey lines represent positive item-factor correlations ($\ge 0.4$), red dashed lines are negative item-factor correlations ($\le -0.4$). Items not connected to a line are not significantly related to any factor.
Figure A1: Snippet of the pseudo-code formatted input prompt used to administer the H60 and DSHS to the GPT models.

Cognitive phantoms in LLMs through the lens of latent variables

TL;DR

Abstract

Cognitive phantoms in LLMs through the lens of latent variables

Authors

TL;DR

Abstract

Table of Contents

Figures (2)