You don't need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments

Bangzhao Shu; Lechen Zhang; Minje Choi; Lavinia Dunagan; Lajanugen Logeswaran; Moontae Lee; Dallas Card; David Jurgens

You don't need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments

Bangzhao Shu, Lechen Zhang, Minje Choi, Lavinia Dunagan, Lajanugen Logeswaran, Moontae Lee, Dallas Card, David Jurgens

TL;DR

This paper interrogates the reliability of using psychometric-style prompts to elicit LLMs' perceived personas. It introduces Model-Personas, a broad benchmark with 39 instruments and 693 questions across 115 axes, and a rigorous framework of prompt variants to separately test content and format effects on 17 LLMs. The results reveal substantial sensitivity to prompt perturbations, limited cross-model consistency, and weak negation handling, challenging the validity of inferring stable latent personas from such prompts. The work highlights the need for robustness checks and suggests possible mitigation strategies, including careful prompt design and targeted fine-tuning, to avoid misinterpreting LLMs as human-like agents.

Abstract

The versatility of Large Language Models (LLMs) on natural language understanding tasks has made them popular for research in social sciences. To properly understand the properties and innate personas of LLMs, researchers have performed studies that involve using prompts in the form of questions that ask LLMs about particular opinions. In this study, we take a cautionary step back and examine whether the current format of prompting LLMs elicits responses in a consistent and robust manner. We first construct a dataset that contains 693 questions encompassing 39 different instruments of persona measurement on 115 persona axes. Additionally, we design a set of prompts containing minor variations and examine LLMs' capabilities to generate answers, as well as prompt variations to examine their consistency with respect to content-level variations such as switching the order of response options or negating the statement. Our experiments on 17 different LLMs reveal that even simple perturbations significantly downgrade a model's question-answering ability, and that most LLMs have low negation consistency. Our results suggest that the currently widespread practice of prompting is insufficient to accurately and reliably capture model perceptions, and we therefore discuss potential alternatives to improve these issues.

You don't need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments

TL;DR

Abstract

Paper Structure (31 sections, 8 figures, 6 tables)

This paper contains 31 sections, 8 figures, 6 tables.

Introduction
Model-Personas: A Comprehensive Benchmark for Measuring Personas
Design of Prompt Variants
Prompts for Spurious Variation
Prompts for Content-level Variation
Experimental Setup
Measuring Model Comprehensibility
Measuring Sensitivity and Consistency
Comparison with Psychometric Measurements of Consistency and Reliability
Model Details
Results
LLMs differ in Comprehensibility
LLMs can be Sensitive even to Spurious Prompt Variation
Staying Consistent is Challenging for LLMs
Most LLMs maintain Order Consistency and Option Consistency
...and 16 more sections

Figures (8)

Figure 1: A comparison of LLMs on different consistency metrics. The area shaded in gray indicates the consistency of answering with a random valid response. We discover that while most LLMs provide consistent results regarding order and option consistency, they struggle with both cases of negation consistency.
Figure 2: Negation Consistency Shift after adding specific personalities into the prompt. Adding personalities decreases the general negation consistency of LLMs, even if some axes' consistencies are increased as outliers.
Figure 3: A comparison of model size and consistency when changing the order of the answers.
Figure 4: A comparison of model size and direct negation consistency. We discover that models' direct negation consistency tends to increase with model size within each model family (except BLOOMZ-560M). However, models of similar sizes perform differently across model families
Figure 5: A comparison of model size and paraphrastic negation consistency. We discover that models' paraphrastic negation consistency is also correlated with model size within each model family (except BLOOMZ-560M)
...and 3 more figures

You don't need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments

TL;DR

Abstract

You don't need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments

Authors

TL;DR

Abstract

Table of Contents

Figures (8)