Table of Contents
Fetching ...

Can LLMs Assess Personality? Validating Conversational AI for Trait Profiling

Andrius Matšenas, Anet Lello, Tõnis Lees, Hans Peep, Kim Lilii Tamm

TL;DR

The paper tackles the limitations of static self-report personality assessment by validating real-time, guided LLM conversations as a dynamic alternative for Big Five profiling. Using a within-subject design (N=33), it compares LLM-derived trait scores against the IPIP-50 gold standard and measures user-perceived accuracy. Results show moderate convergent validity ($r \in [0.38,0.58]$) with three traits (Conscientiousness, Openness, Neuroticism) statistically equivalent across methods, while Agreeableness and Extraversion show trait-specific differences; participants rate both methods as equally accurate. The work contributes a validation framework for conversational psychometrics and highlights practical potential for consumer applications, albeit with calibration needs for certain traits and limitations in generalizability.

Abstract

This study validates Large Language Models (LLMs) as a dynamic alternative to questionnaire-based personality assessment. Using a within-subjects experiment (N=33), we compared Big Five personality scores derived from guided LLM conversations against the gold-standard IPIP-50 questionnaire, while also measuring user-perceived accuracy. Results indicate moderate convergent validity (r=0.38-0.58), with Conscientiousness, Openness, and Neuroticism scores statistically equivalent between methods. Agreeableness and Extraversion showed significant differences, suggesting trait-specific calibration is needed. Notably, participants rated LLM-generated profiles as equally accurate as traditional questionnaire results. These findings suggest conversational AI offers a promising new approach to traditional psychometrics.

Can LLMs Assess Personality? Validating Conversational AI for Trait Profiling

TL;DR

The paper tackles the limitations of static self-report personality assessment by validating real-time, guided LLM conversations as a dynamic alternative for Big Five profiling. Using a within-subject design (N=33), it compares LLM-derived trait scores against the IPIP-50 gold standard and measures user-perceived accuracy. Results show moderate convergent validity () with three traits (Conscientiousness, Openness, Neuroticism) statistically equivalent across methods, while Agreeableness and Extraversion show trait-specific differences; participants rate both methods as equally accurate. The work contributes a validation framework for conversational psychometrics and highlights practical potential for consumer applications, albeit with calibration needs for certain traits and limitations in generalizability.

Abstract

This study validates Large Language Models (LLMs) as a dynamic alternative to questionnaire-based personality assessment. Using a within-subjects experiment (N=33), we compared Big Five personality scores derived from guided LLM conversations against the gold-standard IPIP-50 questionnaire, while also measuring user-perceived accuracy. Results indicate moderate convergent validity (r=0.38-0.58), with Conscientiousness, Openness, and Neuroticism scores statistically equivalent between methods. Agreeableness and Extraversion showed significant differences, suggesting trait-specific calibration is needed. Notably, participants rated LLM-generated profiles as equally accurate as traditional questionnaire results. These findings suggest conversational AI offers a promising new approach to traditional psychometrics.
Paper Structure (27 sections, 7 figures, 4 tables)

This paper contains 27 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Experiment design
  • Figure 2: Example spider graph showing Big Five personality trait scores from both methods, displayed to participants before final feedback.
  • Figure 3: Distribution of Big Five trait scores by assessment method.
  • Figure 4: Hierarchical clustering of assessment methods based on result scores (left) and self-reported accuracy ratings (right).
  • Figure 5: Distribution of participant accuracy ratings by trait and method.
  • ...and 2 more figures