Is Personality Prediction Possible Based on Reddit Comments?
Robert Deimann, Till Preidt, Shaptarshi Roy, Jan Stanicki
TL;DR
The paper investigates whether MBTI personality types can be inferred from Reddit comments using ALBERT-based sequence classifiers trained on a large, MBTI-labeled Reddit dataset. It systematically compares 16-type, 8-function, and 4-binary-axis labeling schemes across multiple sampling regimes (balanced vs proportional) and data partitions, reporting that balanced training improves performance though overall accuracy remains limited for fine-grained types. Key contributions include a detailed data-collection pipeline with masking of explicit type mentions, a comprehensive evaluation of multiple labeling schemes, and a call for context-aware modeling and larger-scale data. The work demonstrates a measurable, albeit imperfect, signal that text contains learnable personality cues and provides public resources (models and code) to facilitate further research in personality prediction from language.
Abstract
In this assignment, we examine whether there is a correlation between the personality type of a person and the texts they wrote. In order to do this, we aggregated datasets of Reddit comments labeled with the Myers-Briggs Type Indicator (MBTI) of the author and built different supervised classifiers based on BERT to try to predict the personality of an author given a text. Despite experiencing issues with the unfiltered character of the dataset, we can observe potential in the classification.
