Is Personality Prediction Possible Based on Reddit Comments?

Robert Deimann; Till Preidt; Shaptarshi Roy; Jan Stanicki

Is Personality Prediction Possible Based on Reddit Comments?

Robert Deimann, Till Preidt, Shaptarshi Roy, Jan Stanicki

TL;DR

The paper investigates whether MBTI personality types can be inferred from Reddit comments using ALBERT-based sequence classifiers trained on a large, MBTI-labeled Reddit dataset. It systematically compares 16-type, 8-function, and 4-binary-axis labeling schemes across multiple sampling regimes (balanced vs proportional) and data partitions, reporting that balanced training improves performance though overall accuracy remains limited for fine-grained types. Key contributions include a detailed data-collection pipeline with masking of explicit type mentions, a comprehensive evaluation of multiple labeling schemes, and a call for context-aware modeling and larger-scale data. The work demonstrates a measurable, albeit imperfect, signal that text contains learnable personality cues and provides public resources (models and code) to facilitate further research in personality prediction from language.

Abstract

In this assignment, we examine whether there is a correlation between the personality type of a person and the texts they wrote. In order to do this, we aggregated datasets of Reddit comments labeled with the Myers-Briggs Type Indicator (MBTI) of the author and built different supervised classifiers based on BERT to try to predict the personality of an author given a text. Despite experiencing issues with the unfiltered character of the dataset, we can observe potential in the classification.

Is Personality Prediction Possible Based on Reddit Comments?

TL;DR

Abstract

Paper Structure (16 sections, 9 figures, 1 table)

This paper contains 16 sections, 9 figures, 1 table.

Team roles and responsibility
Introduction
Related Work
Myers Briggs Type Indicator
Data Collection
Limitations of the dataset
Sampling
Method
Evaluation
Analysis
Qualitative Analysis with a Bag of Words approach
Classification Analysis
Future Work
Conclusion
Code
...and 1 more sections

Figures (9)

Figure 1: Distribution of users given their label. Red: Number of authors from /r/MBTI/. Blue: Number of authors with additional collection from specific class-subreddits.
Figure 2: Distribution of classes in the proportionate samples. Blue: Total sample. Green: sample without MBTI-subreddits. Orange: MBTI-only subreddits.
Figure 3: Confusion matrix in early training with imbalanced dataset
Figure 4: Distribution of languages for the class INTP in descending order.
Figure 5: Distribution of languages for the class INTP in descending order.
...and 4 more figures

Is Personality Prediction Possible Based on Reddit Comments?

TL;DR

Abstract

Is Personality Prediction Possible Based on Reddit Comments?

Authors

TL;DR

Abstract

Table of Contents

Figures (9)