Table of Contents
Fetching ...

Is Personality Prediction Possible Based on Reddit Comments?

Robert Deimann, Till Preidt, Shaptarshi Roy, Jan Stanicki

TL;DR

The paper investigates whether MBTI personality types can be inferred from Reddit comments using ALBERT-based sequence classifiers trained on a large, MBTI-labeled Reddit dataset. It systematically compares 16-type, 8-function, and 4-binary-axis labeling schemes across multiple sampling regimes (balanced vs proportional) and data partitions, reporting that balanced training improves performance though overall accuracy remains limited for fine-grained types. Key contributions include a detailed data-collection pipeline with masking of explicit type mentions, a comprehensive evaluation of multiple labeling schemes, and a call for context-aware modeling and larger-scale data. The work demonstrates a measurable, albeit imperfect, signal that text contains learnable personality cues and provides public resources (models and code) to facilitate further research in personality prediction from language.

Abstract

In this assignment, we examine whether there is a correlation between the personality type of a person and the texts they wrote. In order to do this, we aggregated datasets of Reddit comments labeled with the Myers-Briggs Type Indicator (MBTI) of the author and built different supervised classifiers based on BERT to try to predict the personality of an author given a text. Despite experiencing issues with the unfiltered character of the dataset, we can observe potential in the classification.

Is Personality Prediction Possible Based on Reddit Comments?

TL;DR

The paper investigates whether MBTI personality types can be inferred from Reddit comments using ALBERT-based sequence classifiers trained on a large, MBTI-labeled Reddit dataset. It systematically compares 16-type, 8-function, and 4-binary-axis labeling schemes across multiple sampling regimes (balanced vs proportional) and data partitions, reporting that balanced training improves performance though overall accuracy remains limited for fine-grained types. Key contributions include a detailed data-collection pipeline with masking of explicit type mentions, a comprehensive evaluation of multiple labeling schemes, and a call for context-aware modeling and larger-scale data. The work demonstrates a measurable, albeit imperfect, signal that text contains learnable personality cues and provides public resources (models and code) to facilitate further research in personality prediction from language.

Abstract

In this assignment, we examine whether there is a correlation between the personality type of a person and the texts they wrote. In order to do this, we aggregated datasets of Reddit comments labeled with the Myers-Briggs Type Indicator (MBTI) of the author and built different supervised classifiers based on BERT to try to predict the personality of an author given a text. Despite experiencing issues with the unfiltered character of the dataset, we can observe potential in the classification.
Paper Structure (16 sections, 9 figures, 1 table)

This paper contains 16 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: Distribution of users given their label. Red: Number of authors from /r/MBTI/. Blue: Number of authors with additional collection from specific class-subreddits.
  • Figure 2: Distribution of classes in the proportionate samples. Blue: Total sample. Green: sample without MBTI-subreddits. Orange: MBTI-only subreddits.
  • Figure 3: Confusion matrix in early training with imbalanced dataset
  • Figure 4: Distribution of languages for the class INTP in descending order.
  • Figure 5: Distribution of languages for the class INTP in descending order.
  • ...and 4 more figures