Table of Contents
Fetching ...

Human Psychometric Questionnaires Mischaracterize LLM Psychology: Evidence from Generation Behavior

Woojung Song, Dongmin Choi, Yoonah Park, Jongwook Han, Eun-Ju Lee, Yohan Jo

Abstract

Psychological profiling of large language models (LLMs) using psychometric questionnaires designed for humans has become widespread. However, it remains unclear whether the resulting profiles mirror the models' psychological characteristics expressed during their real-world interactions with users. To examine the risk of human questionnaires mischaracterizing LLM psychology, we compare two types of profiles for eight open-source LLMs: self-reported Likert scores from established questionnaires (PVQ-40, PVQ-21, BFI-44, BFI-10) and generation probability scores of value- or personality-laden responses to real-world user queries. The two profiles turn out to be substantially different and provide evidence that LLMs' responses to established questionnaires reflect desired behavior rather than stable psychological constructs, which challenges the consistent psychological dispositions of LLMs claimed in prior work. Established questionnaires also risk exaggerating the demographic biases of LLMs. Our results suggest caution when interpreting psychological profiles derived from established questionnaires and point to generation-based profiling as a more reliable approach to LLM psychometrics.

Human Psychometric Questionnaires Mischaracterize LLM Psychology: Evidence from Generation Behavior

Abstract

Psychological profiling of large language models (LLMs) using psychometric questionnaires designed for humans has become widespread. However, it remains unclear whether the resulting profiles mirror the models' psychological characteristics expressed during their real-world interactions with users. To examine the risk of human questionnaires mischaracterizing LLM psychology, we compare two types of profiles for eight open-source LLMs: self-reported Likert scores from established questionnaires (PVQ-40, PVQ-21, BFI-44, BFI-10) and generation probability scores of value- or personality-laden responses to real-world user queries. The two profiles turn out to be substantially different and provide evidence that LLMs' responses to established questionnaires reflect desired behavior rather than stable psychological constructs, which challenges the consistent psychological dispositions of LLMs claimed in prior work. Established questionnaires also risk exaggerating the demographic biases of LLMs. Our results suggest caution when interpreting psychological profiles derived from established questionnaires and point to generation-based profiling as a more reliable approach to LLM psychometrics.

Paper Structure

This paper contains 87 sections, 10 equations, 18 figures, 39 tables.

Figures (18)

  • Figure 1: Cosine similarity heatmaps for PVQ-40 and VP value items. (a, b) Item--definition similarity. (c, d) Within-construct item similarity. Established items (a, c) show diagonal structure; VP items (b, d) do not.
  • Figure 2: PVQ-40 Likert prompt template --- Variant 1 (high-to-low options).
  • Figure 3: PVQ-40 Likert prompt template --- Variant 2 (low-to-high options).
  • Figure 4: PVQ-21 Likert prompt template --- Variant 1 (high-to-low options).
  • Figure 5: PVQ-21 Likert prompt template --- Variant 2 (low-to-high options).
  • ...and 13 more figures