Table of Contents
Fetching ...

Mind Reading or Misreading? LLMs on the Big Five Personality Test

Francesco Di Cursi, Chiara Boldrini, Marco Conti, Andrea Passarella

TL;DR

This study rigorously evaluates five LLMs, including GPT-4 and open-source variants, on binary Big Five personality prediction across three diverse datasets under two prompting regimes. It demonstrates that while enriched prompts reduce invalid outputs and can balance predictions for weaker models, they introduce a systematic bias toward predicting trait presence. No model consistently achieves reliable zero-shot performance, and aggregate metrics can mask per-class weaknesses; per-class recall emerges as a more informative diagnostic. The findings underscore the need for careful prompt design, trait framing, and transparent evaluation when applying LLMs to APPT in practical settings.

Abstract

We evaluate large language models (LLMs) for automatic personality prediction from text under the binary Five Factor Model (BIG5). Five models -- including GPT-4 and lightweight open-source alternatives -- are tested across three heterogeneous datasets (Essays, MyPersonality, Pandora) and two prompting strategies (minimal vs. enriched with linguistic and psychological cues). Enriched prompts reduce invalid outputs and improve class balance, but also introduce a systematic bias toward predicting trait presence. Performance varies substantially: Openness and Agreeableness are relatively easier to detect, while Extraversion and Neuroticism remain challenging. Although open-source models sometimes approach GPT-4 and prior benchmarks, no configuration yields consistently reliable predictions in zero-shot binary settings. Moreover, aggregate metrics such as accuracy and macro-F1 mask significant asymmetries, with per-class recall offering clearer diagnostic value. These findings show that current out-of-the-box LLMs are not yet suitable for APPT, and that careful coordination of prompt design, trait framing, and evaluation metrics is essential for interpretable results.

Mind Reading or Misreading? LLMs on the Big Five Personality Test

TL;DR

This study rigorously evaluates five LLMs, including GPT-4 and open-source variants, on binary Big Five personality prediction across three diverse datasets under two prompting regimes. It demonstrates that while enriched prompts reduce invalid outputs and can balance predictions for weaker models, they introduce a systematic bias toward predicting trait presence. No model consistently achieves reliable zero-shot performance, and aggregate metrics can mask per-class weaknesses; per-class recall emerges as a more informative diagnostic. The findings underscore the need for careful prompt design, trait framing, and transparent evaluation when applying LLMs to APPT in practical settings.

Abstract

We evaluate large language models (LLMs) for automatic personality prediction from text under the binary Five Factor Model (BIG5). Five models -- including GPT-4 and lightweight open-source alternatives -- are tested across three heterogeneous datasets (Essays, MyPersonality, Pandora) and two prompting strategies (minimal vs. enriched with linguistic and psychological cues). Enriched prompts reduce invalid outputs and improve class balance, but also introduce a systematic bias toward predicting trait presence. Performance varies substantially: Openness and Agreeableness are relatively easier to detect, while Extraversion and Neuroticism remain challenging. Although open-source models sometimes approach GPT-4 and prior benchmarks, no configuration yields consistently reliable predictions in zero-shot binary settings. Moreover, aggregate metrics such as accuracy and macro-F1 mask significant asymmetries, with per-class recall offering clearer diagnostic value. These findings show that current out-of-the-box LLMs are not yet suitable for APPT, and that careful coordination of prompt design, trait framing, and evaluation metrics is essential for interpretable results.

Paper Structure

This paper contains 26 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Number of experiments per prompt strategy and evaluation framework
  • Figure 2: Evaluation of LLMs according to class-wise F1.
  • Figure 3: Evaluation of LLMs according to precision and recall gaps.
  • Figure 4: Differences in the metrics of full binary classification report (rows), per trait (columns) and models (bars). The x values indicate the gap between the results of complex and simple prompt (i.e., Complex - Simple) w.r.t. each evaluation metric. Note that the plot relative to Pandora has been moved to Appendix in Figure \ref{['fig:pandora_delta']} due to space constraints
  • Figure 5: All experiments with class-wise F1 $\geq$ 0.5 (sorted in descending order on Accuracy, F1_1 and F1_0)
  • ...and 5 more figures