Table of Contents
Fetching ...

Large Language Models Can Infer Personality from Free-Form User Interactions

Heinrich Peters, Moran Cerf, Sandra C. Matz

TL;DR

The paper investigates whether large language models can infer Big Five personality traits from free-form conversations and how prompt design and interaction mode affect accuracy and user experience. Using a 3x2 between-subjects design with GPT-4 across assessment, acquaintance, and assistant prompts and two user modes, the study measures correlations with BF I-2 and collects UX data. Findings show strongest inferences when the chatbot is prompted to assess personality, with meaningful signals also present in naturalistic interactions, while always maintaining generally positive user experiences. The work demonstrates scalable, conversational psychological profiling potential but also highlights ethical and privacy challenges that require thoughtful governance as these capabilities scale.

Abstract

This study investigates the capacity of Large Language Models (LLMs) to infer the Big Five personality traits from free-form user interactions. The results demonstrate that a chatbot powered by GPT-4 can infer personality with moderate accuracy, outperforming previous approaches drawing inferences from static text content. The accuracy of inferences varied across different conversational settings. Performance was highest when the chatbot was prompted to elicit personality-relevant information from users (mean r=.443, range=[.245, .640]), followed by a condition placing greater emphasis on naturalistic interaction (mean r=.218, range=[.066, .373]). Notably, the direct focus on personality assessment did not result in a less positive user experience, with participants reporting the interactions to be equally natural, pleasant, engaging, and humanlike across both conditions. A chatbot mimicking ChatGPT's default behavior of acting as a helpful assistant led to markedly inferior personality inferences and lower user experience ratings but still captured psychologically meaningful information for some of the personality traits (mean r=.117, range=[-.004, .209]). Preliminary analyses suggest that the accuracy of personality inferences varies only marginally across different socio-demographic subgroups. Our results highlight the potential of LLMs for psychological profiling based on conversational interactions. We discuss practical implications and ethical challenges associated with these findings.

Large Language Models Can Infer Personality from Free-Form User Interactions

TL;DR

The paper investigates whether large language models can infer Big Five personality traits from free-form conversations and how prompt design and interaction mode affect accuracy and user experience. Using a 3x2 between-subjects design with GPT-4 across assessment, acquaintance, and assistant prompts and two user modes, the study measures correlations with BF I-2 and collects UX data. Findings show strongest inferences when the chatbot is prompted to assess personality, with meaningful signals also present in naturalistic interactions, while always maintaining generally positive user experiences. The work demonstrates scalable, conversational psychological profiling potential but also highlights ethical and privacy challenges that require thoughtful governance as these capabilities scale.

Abstract

This study investigates the capacity of Large Language Models (LLMs) to infer the Big Five personality traits from free-form user interactions. The results demonstrate that a chatbot powered by GPT-4 can infer personality with moderate accuracy, outperforming previous approaches drawing inferences from static text content. The accuracy of inferences varied across different conversational settings. Performance was highest when the chatbot was prompted to elicit personality-relevant information from users (mean r=.443, range=[.245, .640]), followed by a condition placing greater emphasis on naturalistic interaction (mean r=.218, range=[.066, .373]). Notably, the direct focus on personality assessment did not result in a less positive user experience, with participants reporting the interactions to be equally natural, pleasant, engaging, and humanlike across both conditions. A chatbot mimicking ChatGPT's default behavior of acting as a helpful assistant led to markedly inferior personality inferences and lower user experience ratings but still captured psychologically meaningful information for some of the personality traits (mean r=.117, range=[-.004, .209]). Preliminary analyses suggest that the accuracy of personality inferences varies only marginally across different socio-demographic subgroups. Our results highlight the potential of LLMs for psychological profiling based on conversational interactions. We discuss practical implications and ethical challenges associated with these findings.
Paper Structure (22 sections, 6 figures, 1 table)

This paper contains 22 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Correlations between inferred and self-reported personality trait scores across conditions. ChatGPT conditions (assessment, acquaintance, assistant) are shown in different columns. User conditions (conversation, unconstrained) are shown across rows. The vertical black lines represent two-tailed 95% confidence intervals. The horizontal black lines represent one-tailed lower 95% confidence intervals.
  • Figure 2: User experience ratings across items (The conversation was natural; The conversation was pleasant; The conversation was engaging; My conversation partner asked good questions; My conversation partner gave good answers; My conversation partner was humanlike) and conditions. The vertical black lines represent two-sided 95% confidence intervals.
  • Figure 3: Group differences in residuals (left) and correlations (right) across demographic groups. Residuals were computed as the difference between inferred and self-reported personality scores, such that a negative value indicates a negative bias in inferred scores. A larger bar indicates larger residuals for the specified group. Correlations were computed as Pearson’s correlation coefficients between inferred and self-reported personality scores within each group. A larger bar indicates higher accuracy for the specified group.
  • Figure :
  • Figure :
  • ...and 1 more figures