Table of Contents
Fetching ...

Personality Profiling: How informative are social media profiles in predicting personal information?

Joshua Watt, Lewis Mitchell, Jonathan Tuke

TL;DR

The extent to which peoples’ online digital footprints can be used to profile their Myers- Briggs personality type is explored and four models are analysed: logistic regression, naive Bayes, support vector machines (SVMs) and random forests.

Abstract

Personality profiling has been utilised by companies for targeted advertising, political campaigns and public health campaigns. However, the accuracy and versatility of such models remains relatively unknown. Here we explore the extent to which peoples' online digital footprints can be used to profile their Myers-Briggs personality type. We analyse and compare four models: logistic regression, naive Bayes, support vector machines (SVMs) and random forests. We discover that a SVM model achieves the best accuracy of 20.95% for predicting a complete personality type. However, logistic regression models perform only marginally worse and are significantly faster to train and perform predictions. Moreover, we develop a statistical framework for assessing the importance of different sets of features in our models. We discover some features to be more informative than others in the Intuitive/Sensory (p = 0.032) and Thinking/Feeling (p = 0.019) models. Many labelled datasets present substantial class imbalances of personal characteristics on social media, including our own. We therefore highlight the need for attentive consideration when reporting model performance on such datasets and compare a number of methods to fix class-imbalance problems.

Personality Profiling: How informative are social media profiles in predicting personal information?

TL;DR

The extent to which peoples’ online digital footprints can be used to profile their Myers- Briggs personality type is explored and four models are analysed: logistic regression, naive Bayes, support vector machines (SVMs) and random forests.

Abstract

Personality profiling has been utilised by companies for targeted advertising, political campaigns and public health campaigns. However, the accuracy and versatility of such models remains relatively unknown. Here we explore the extent to which peoples' online digital footprints can be used to profile their Myers-Briggs personality type. We analyse and compare four models: logistic regression, naive Bayes, support vector machines (SVMs) and random forests. We discover that a SVM model achieves the best accuracy of 20.95% for predicting a complete personality type. However, logistic regression models perform only marginally worse and are significantly faster to train and perform predictions. Moreover, we develop a statistical framework for assessing the importance of different sets of features in our models. We discover some features to be more informative than others in the Intuitive/Sensory (p = 0.032) and Thinking/Feeling (p = 0.019) models. Many labelled datasets present substantial class imbalances of personal characteristics on social media, including our own. We therefore highlight the need for attentive consideration when reporting model performance on such datasets and compare a number of methods to fix class-imbalance problems.
Paper Structure (7 sections, 6 figures, 4 tables)

This paper contains 7 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Proportion of accounts displaying each dichotomous trait in our dataset, on Twitter and in the general population.
  • Figure 2: Confusion matrices for modelling the N/S dichotomy.
  • Figure 3: Variable Importance Plots for an upsampled LR model for each dichotomy. Variables sorted by the absolute value of variable importance. Bars coloured by feature preference for each class.
  • Figure 4: Variable Importance Plots for emoji counts in the upsampled LR models. Variables sorted by absolute value of variable importance. We colour bars by the feature preference for each class.
  • Figure 5: Word clouds of tweets/quotes containing specific emojis in our dataset: rocket ship (left) and red heart (right). Note that we remove stopwords as they do not provide much context for the tweets.
  • ...and 1 more figures