Table of Contents
Fetching ...

Self-Assessment Tests are Unreliable Measures of LLM Personality

Akshat Gupta, Xiaoyang Song, Gopala Anumanchipalli

TL;DR

The paper interrogates the validity of measuring LLM personality with human self-assessment tests. By conducting two lightweight experiments—prompt sensitivity and option-order symmetry—across ChatGPT and multiple Llama2 variants using IPIP-300 items, it demonstrates substantial and statistically significant variability in trait scores solely due to prompt form and answer ordering ($\alpha = 0.05$). The results challenge the reliability of self-assessment instruments for LLMs and suggest that previous claims of LLM personality may reflect test design rather than intrinsic traits. Consequently, the work advocates against using these instruments for personality quantification in LLMs and highlights the need for more robust, principled evaluation methods that account for prompt and interface sensitivities.

Abstract

As large language models (LLM) evolve in their capabilities, various recent studies have tried to quantify their behavior using psychological tools created to study human behavior. One such example is the measurement of "personality" of LLMs using self-assessment personality tests developed to measure human personality. Yet almost none of these works verify the applicability of these tests on LLMs. In this paper, we analyze the reliability of LLM personality scores obtained from self-assessment personality tests using two simple experiments. We first introduce the property of prompt sensitivity, where three semantically equivalent prompts representing three intuitive ways of administering self-assessment tests on LLMs are used to measure the personality of the same LLM. We find that all three prompts lead to very different personality scores, a difference that is statistically significant for all traits in a large majority of scenarios. We then introduce the property of option-order symmetry for personality measurement of LLMs. Since most of the self-assessment tests exist in the form of multiple choice question (MCQ) questions, we argue that the scores should also be robust to not just the prompt template but also the order in which the options are presented. This test unsurprisingly reveals that the self-assessment test scores are not robust to the order of the options. These simple tests, done on ChatGPT and three Llama2 models of different sizes, show that self-assessment personality tests created for humans are unreliable measures of personality in LLMs.

Self-Assessment Tests are Unreliable Measures of LLM Personality

TL;DR

The paper interrogates the validity of measuring LLM personality with human self-assessment tests. By conducting two lightweight experiments—prompt sensitivity and option-order symmetry—across ChatGPT and multiple Llama2 variants using IPIP-300 items, it demonstrates substantial and statistically significant variability in trait scores solely due to prompt form and answer ordering (). The results challenge the reliability of self-assessment instruments for LLMs and suggest that previous claims of LLM personality may reflect test design rather than intrinsic traits. Consequently, the work advocates against using these instruments for personality quantification in LLMs and highlights the need for more robust, principled evaluation methods that account for prompt and interface sensitivities.

Abstract

As large language models (LLM) evolve in their capabilities, various recent studies have tried to quantify their behavior using psychological tools created to study human behavior. One such example is the measurement of "personality" of LLMs using self-assessment personality tests developed to measure human personality. Yet almost none of these works verify the applicability of these tests on LLMs. In this paper, we analyze the reliability of LLM personality scores obtained from self-assessment personality tests using two simple experiments. We first introduce the property of prompt sensitivity, where three semantically equivalent prompts representing three intuitive ways of administering self-assessment tests on LLMs are used to measure the personality of the same LLM. We find that all three prompts lead to very different personality scores, a difference that is statistically significant for all traits in a large majority of scenarios. We then introduce the property of option-order symmetry for personality measurement of LLMs. Since most of the self-assessment tests exist in the form of multiple choice question (MCQ) questions, we argue that the scores should also be robust to not just the prompt template but also the order in which the options are presented. This test unsurprisingly reveals that the self-assessment test scores are not robust to the order of the options. These simple tests, done on ChatGPT and three Llama2 models of different sizes, show that self-assessment personality tests created for humans are unreliable measures of personality in LLMs.
Paper Structure (11 sections, 6 figures, 3 tables)

This paper contains 11 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Self assessment personality test scores for Llamav2 and ChatGPT on the IPIP-300 dataset. The prompts appended with "(R)" contain the reverse option order or scale measurement prompts as described in section \ref{['sec:option-order-sensitivity']}. For numbers with standard deviations, please refer to Table \ref{['table:scores']}.
  • Figure 2: Pairwise distributional difference test results for ChatGPT on IPIP-300 dataset. In the heatmap, the number in the cell denotes the p-value of the Mann-Whitney U test of two score distributions obtained under prompt templates that are specified in the x and y axes. Note that the naming of the prompt templates follows Table \ref{['tab:prompt_list']}; for instance, $P1_O$ represents Prompt 1 with the original order.
  • Figure 3: Summary statistics of hypothesis tests results.
  • Figure 4: Pairwise distributional difference test results for Llamav2-7B on IPIP 300 dataset.
  • Figure 5: Pairwise distributional difference test results for Llamav2-13B on IPIP 300 dataset.
  • ...and 1 more figures