Table of Contents
Fetching ...

Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis

Nikolay B Petrov, Gregory Serapio-García, Jason Rentfrow

TL;DR

The paper examines whether contemporary LLMs can authentically simulate human personality traits in psychometric tasks. It applies a psychometric framework to GPT-3.5 and GPT-4 under two prompting regimes—generic and silicon personas—across a battery including the Big Five Inventory and related scales, benchmarking against a large BBC-based human ground-truth dataset. Findings show GPT-4 can resemble human norms under generic prompting but fails to recover stable latent trait structure and performs poorly under silicon prompting; GPT-3.5 generally performs worse. The authors caution against using LLMs as proxies for individual-level human behavior and emphasize the need for multi-method validation and targeted instruction tuning to improve reliability.

Abstract

The humanlike responses of large language models (LLMs) have prompted social scientists to investigate whether LLMs can be used to simulate human participants in experiments, opinion polls and surveys. Of central interest in this line of research has been mapping out the psychological profiles of LLMs by prompting them to respond to standardized questionnaires. The conflicting findings of this research are unsurprising given that mapping out underlying, or latent, traits from LLMs' text responses to questionnaires is no easy task. To address this, we use psychometrics, the science of psychological measurement. In this study, we prompt OpenAI's flagship models, GPT-3.5 and GPT-4, to assume different personas and respond to a range of standardized measures of personality constructs. We used two kinds of persona descriptions: either generic (four or five random person descriptions) or specific (mostly demographics of actual humans from a large-scale human dataset). We found that the responses from GPT-4, but not GPT-3.5, using generic persona descriptions show promising, albeit not perfect, psychometric properties, similar to human norms, but the data from both LLMs when using specific demographic profiles, show poor psychometrics properties. We conclude that, currently, when LLMs are asked to simulate silicon personas, their responses are poor signals of potentially underlying latent traits. Thus, our work casts doubt on LLMs' ability to simulate individual-level human behaviour across multiple-choice question answering tasks.

Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis

TL;DR

The paper examines whether contemporary LLMs can authentically simulate human personality traits in psychometric tasks. It applies a psychometric framework to GPT-3.5 and GPT-4 under two prompting regimes—generic and silicon personas—across a battery including the Big Five Inventory and related scales, benchmarking against a large BBC-based human ground-truth dataset. Findings show GPT-4 can resemble human norms under generic prompting but fails to recover stable latent trait structure and performs poorly under silicon prompting; GPT-3.5 generally performs worse. The authors caution against using LLMs as proxies for individual-level human behavior and emphasize the need for multi-method validation and targeted instruction tuning to improve reliability.

Abstract

The humanlike responses of large language models (LLMs) have prompted social scientists to investigate whether LLMs can be used to simulate human participants in experiments, opinion polls and surveys. Of central interest in this line of research has been mapping out the psychological profiles of LLMs by prompting them to respond to standardized questionnaires. The conflicting findings of this research are unsurprising given that mapping out underlying, or latent, traits from LLMs' text responses to questionnaires is no easy task. To address this, we use psychometrics, the science of psychological measurement. In this study, we prompt OpenAI's flagship models, GPT-3.5 and GPT-4, to assume different personas and respond to a range of standardized measures of personality constructs. We used two kinds of persona descriptions: either generic (four or five random person descriptions) or specific (mostly demographics of actual humans from a large-scale human dataset). We found that the responses from GPT-4, but not GPT-3.5, using generic persona descriptions show promising, albeit not perfect, psychometric properties, similar to human norms, but the data from both LLMs when using specific demographic profiles, show poor psychometrics properties. We conclude that, currently, when LLMs are asked to simulate silicon personas, their responses are poor signals of potentially underlying latent traits. Thus, our work casts doubt on LLMs' ability to simulate individual-level human behaviour across multiple-choice question answering tasks.
Paper Structure (7 sections, 12 figures, 5 tables)

This paper contains 7 sections, 12 figures, 5 tables.

Figures (12)

  • Figure A1: Overview of the data collection and analysis processes Each LLM (GPT-3.5 and GPT-4) is prompted using a template of a persona description and a survey item. The persona description can be either a generic one, constructed using 4-5 random sentences from the PersonaChat dataset zhangPersonalizingDialogueAgents2018, or a silicon one, constructed on the basis of mostly demographic information of humans from a large-scale personality survey rentfrow_regional_2015. Survey items that the LLM is asked to evaluate are across various personality-related constructs. The LLM text response is processed to extract a numeric response on the survey item and then further analysed.
  • Figure A2: Item-level response distributions across all responses split by whether the first token was a digit or not LLMs (rows) produce text responses which we further processed to extract a numeric response. Past research has used only responses whose first digit is a numeric one and discarded the rest. Here, we compare the relative frequency (y-axis) of item-level responses across all survey items (x-axis), split by whether the first token was a digit (columns). We see very different distributions of responses across all LLMs and prompting variations when the first token is a digit one vs a non-digit one.
  • Figure A3: Internal consistency of all measures across LLM models and prompting styles We computed three reliability indices (colour) for every questionnaire (columns) and its subscales (x-axis) across all LLMs and prompting variations (rows). The plot shows that a) differences between reliability indices are few and b) the reliability of the data from LLMs, when using silicon sampling, can be very low <.70 for some measures.
  • Figure A4: Intercorrelations between Big Five traits across LLM models and prompting styles Intercorrelations between Big Five traits were computed for the tested LLMs across prompting styles (bottom two rows) and human data from a large representative sample rentfrow_regional_2015 is shown on the top row. The plots show that data from LLMs tend to produce much higher intercorrelations.
  • Figure A5: Criterion validity correlations across LLMs and prompting styles Selected Pearson’s correlations were computed between Big Five traits (columns) and personality-related constructs (x-axis) to test the criterion validity of the tested LLMs (bottom 4 rows). Comparison data from serapio-garciaPersonalityTraitsLarge2023 is shown on the top row. NB: serapio-garciaPersonalityTraitsLarge2023 used the IPIP-NEO to measure Big Five traits, while we used the BFI, though the authors also show that the two subscales are correlated at >.90.
  • ...and 7 more figures