Table of Contents
Fetching ...

Large Language Models Can Infer Psychological Dispositions of Social Media Users

Heinrich Peters, Sandra Matz

TL;DR

The study investigates whether zero-shot LLMs (GPT-3.5 and GPT-4) can infer the Big Five personality traits from Facebook status updates and how accuracy varies by age and gender. Using 1000 MyPersonality users with 200 recent status updates, self-reported IPIP scores are contrasted with LLM-derived trait scores, showing overall correlations of $r_{GPT3.5}=0.27$ and $r_{GPT4}=0.31$, with Openness, Extraversion, and Agreeableness being most detectable. Demographic analyses reveal gender and age biases, with women generally yielding more accurate inferences and older users showing mixed or weaker signals, though within-group correlations remain comparable. Agreement with third-party observer ratings indicates LLM inferences are broadly similar in quality to human judgments, underscoring both the potential and the ethical challenges of automated psychometrics. The authors call for governance, privacy safeguards, and further work to unpack the cues and mechanisms behind these zero-shot inferences, as well as to improve accuracy on less-inferable traits.

Abstract

Large Language Models (LLMs) demonstrate increasingly human-like abilities across a wide variety of tasks. In this paper, we investigate whether LLMs like ChatGPT can accurately infer the psychological dispositions of social media users and whether their ability to do so varies across socio-demographic groups. Specifically, we test whether GPT-3.5 and GPT-4 can derive the Big Five personality traits from users' Facebook status updates in a zero-shot learning scenario. Our results show an average correlation of r = .29 (range = [.22, .33]) between LLM-inferred and self-reported trait scores - a level of accuracy that is similar to that of supervised machine learning models specifically trained to infer personality. Our findings also highlight heterogeneity in the accuracy of personality inferences across different age groups and gender categories: predictions were found to be more accurate for women and younger individuals on several traits, suggesting a potential bias stemming from the underlying training data or differences in online self-expression. The ability of LLMs to infer psychological dispositions from user-generated text has the potential to democratize access to cheap and scalable psychometric assessments for both researchers and practitioners. On the one hand, this democratization might facilitate large-scale research of high ecological validity and spark innovation in personalized services. On the other hand, it also raises ethical concerns regarding user privacy and self-determination, highlighting the need for stringent ethical frameworks and regulation.

Large Language Models Can Infer Psychological Dispositions of Social Media Users

TL;DR

The study investigates whether zero-shot LLMs (GPT-3.5 and GPT-4) can infer the Big Five personality traits from Facebook status updates and how accuracy varies by age and gender. Using 1000 MyPersonality users with 200 recent status updates, self-reported IPIP scores are contrasted with LLM-derived trait scores, showing overall correlations of and , with Openness, Extraversion, and Agreeableness being most detectable. Demographic analyses reveal gender and age biases, with women generally yielding more accurate inferences and older users showing mixed or weaker signals, though within-group correlations remain comparable. Agreement with third-party observer ratings indicates LLM inferences are broadly similar in quality to human judgments, underscoring both the potential and the ethical challenges of automated psychometrics. The authors call for governance, privacy safeguards, and further work to unpack the cues and mechanisms behind these zero-shot inferences, as well as to improve accuracy on less-inferable traits.

Abstract

Large Language Models (LLMs) demonstrate increasingly human-like abilities across a wide variety of tasks. In this paper, we investigate whether LLMs like ChatGPT can accurately infer the psychological dispositions of social media users and whether their ability to do so varies across socio-demographic groups. Specifically, we test whether GPT-3.5 and GPT-4 can derive the Big Five personality traits from users' Facebook status updates in a zero-shot learning scenario. Our results show an average correlation of r = .29 (range = [.22, .33]) between LLM-inferred and self-reported trait scores - a level of accuracy that is similar to that of supervised machine learning models specifically trained to infer personality. Our findings also highlight heterogeneity in the accuracy of personality inferences across different age groups and gender categories: predictions were found to be more accurate for women and younger individuals on several traits, suggesting a potential bias stemming from the underlying training data or differences in online self-expression. The ability of LLMs to infer psychological dispositions from user-generated text has the potential to democratize access to cheap and scalable psychometric assessments for both researchers and practitioners. On the one hand, this democratization might facilitate large-scale research of high ecological validity and spark innovation in personalized services. On the other hand, it also raises ethical concerns regarding user privacy and self-determination, highlighting the need for stringent ethical frameworks and regulation.
Paper Structure (15 sections, 4 figures)

This paper contains 15 sections, 4 figures.

Figures (4)

  • Figure 1: Distributions of self-reported and inferred personality scores for GPT-3.5 and GPT-4. Histograms show absolute frequencies for an overall sample size of n=1000. GPT-3.5 underestimates Openness. Both models underestimate Conscientiousness and Agreeableness but overestimate Neuroticism. For Extraversion, the two models diverge with GPT-3.5 underestimating and GPT-4 overestimating the true scores. Overall, GPT-4 inferred scores were more aligned with self-reported scores, indicating a potential improvement over GPT-3.5.
  • Figure 2: Pearson’s correlation coefficients between inferred and self-reported scores with 95% confidence intervals (left), and Pearson’s correlation coefficients for GPT-3.5 (mid) and GPT-4 as a function of message volume (right). O: Openness; C: Conscientiousness; E: Extraversion; A: Agreeableness; N: Neuroticism. Inferences for Openness, Extraversion, and Agreeableness were more accurate than those for Conscientiousness and Neuroticism, but the differences remained non-significant. Higher message volume was associated with higher levels of predictive accuracy, but a substantial share of variance was captured in as little as 20 status messages.
  • Figure 3: Mean differences in personality scores between gender groups (left) and age groups (right) for self-reported scores as well as inferences by GPT-3.5 and GPT-4. Positive values indicate higher scores for female users compared to male users and older users compared to younger users. O: Openness; C: Conscientiousness; E: Extraversion; A: Agreeableness; N: Neuroticism. ***p<.001; **p<.01; *p<.05. The results show significant gender and age differences across all personality traits.
  • Figure 4: Mean differences in absolute residuals between gender groups (left) and age groups (right) for inferences by GPT-3.5 and GPT-4. Positive values indicate higher residuals for female users compared to male users and older users compared to younger users. O: Openness; C: Conscientiousness; E: Extraversion; A: Agreeableness; N: Neuroticism. ***p<.001; **p<.01; *p<.05. The results indicate lower residuals for female users in all personality traits except Extraversion. Age-related biases were observed for Openness, Conscientiousness, and Agreeableness in inferences by GPT-3.5 but not GPT-4.