Revisiting the Reliability of Psychological Scales on Large Language Models

Jen-tse Huang; Wenxiang Jiao; Man Ho Lam; Eric John Li; Wenxuan Wang; Michael R. Lyu

Revisiting the Reliability of Psychological Scales on Large Language Models

Jen-tse Huang, Wenxiang Jiao, Man Ho Lam, Eric John Li, Wenxuan Wang, Michael R. Lyu

TL;DR

This study interrogates whether human personality scales are reliable when applied to Large Language Models by systematically varying prompts across 2,500 configurations and multiple models. It introduces a five-factor framework examining instruction, item wording, language, choice labeling, and item order, and uses the Big Five Inventory to assess reliability and stability. The results show GPT-3.5-Turbo, GPT-4-Turbo, and Gemini-Pro demonstrating consistent BFI responses, with LLaMA-3.1-8B displaying more variability, and reveal that prompt-based strategies can actively shape LLM personality representations, especially using the POR approach. The work offers a framework for reliability assessment of psychometric scales on LLMs, highlights the potential to simulate diverse human populations for social science research, and discusses limitations related to validity, translation effects, and ethical implications.

Abstract

Recent research has focused on examining Large Language Models' (LLMs) characteristics from a psychological standpoint, acknowledging the necessity of understanding their behavioral characteristics. The administration of personality tests to LLMs has emerged as a noteworthy area in this context. However, the suitability of employing psychological scales, initially devised for humans, on LLMs is a matter of ongoing debate. Our study aims to determine the reliability of applying personality assessments to LLMs, explicitly investigating whether LLMs demonstrate consistent personality traits. Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory, indicating a satisfactory level of reliability. Furthermore, our research explores the potential of GPT-3.5 to emulate diverse personalities and represent various groups-a capability increasingly sought after in social sciences for substituting human participants with LLMs to reduce costs. Our findings reveal that LLMs have the potential to represent different personalities with specific prompt instructions.

Revisiting the Reliability of Psychological Scales on Large Language Models

TL;DR

Abstract

Paper Structure (33 sections, 8 figures, 15 tables)

This paper contains 33 sections, 8 figures, 15 tables.

Introduction
Preliminaries
Personality Tests
Reliability and Validity of Scales
The Reliability of Scales on LLMs
Framework Design
(1) Instruction
(2) Item
(3) Language
(4) Choice Label
(5) Choice Order
Experimental Results
Visualization
Quantitative Analysis
Test-Retest Reliability
...and 18 more sections

Figures (8)

Figure 1: Biweekly measurements starting from mid-September 2023 to late-January 2024 of the BFI on GPT-3.5-Turbo. The model experienced two different versions (0613, 1106) during this period. The shadow represents the standard deviation ($\pm Std$).
Figure 2: Visualization (projecting BFI’s five dimensions to a 2-D space) of 2,500 GPT-3.5-Turbo data points. (a): the outliers and main body with the probability density (the darker the denser). (b) to (f): different options in each factor, marked in distinct colors and shapes. The gray area illustrates the all possible values in BFI tests.
Figure 3: Visualization (projecting BFI’s five dimensions to a 2-D space) of all GPT-3.5-Turbo data points under different methods of manipulating personalities. Different situations are marked in distinct colors and shapes, while the original (default) personality distribution of GPT-3.5-Turbo is shown in gray triangles. (a) and (b): creating an environment. (c) and (d): assigning a personality. (e) and (f): embodying a character.
Figure 4: Visualization (projecting BFI’s five dimensions to a 2-D space) of GPT-3.5-Turbo data points of assigning personalities and embodying characters. Whether or not to use CoT is distinguished in red and blue, while the original (default) personality distribution of GPT-3.5-Turbo is shown in gray triangles.
Figure 5: Visualization (projecting BFI’s five dimensions to a 2-D space) of all GPT-4-Turbo data points. (a): the outliers and main body with the probability density (the darker the denser). (b) to (f): different options in each factor, marked in distinct colors and shapes. The gray area illustrates the all possible values in BFI tests.
...and 3 more figures

Revisiting the Reliability of Psychological Scales on Large Language Models

TL;DR

Abstract

Revisiting the Reliability of Psychological Scales on Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)