Table of Contents
Fetching ...

Uncovering the Sociodemographic Fabric of Reddit

Federico Cinus, Corrado Monti, Paolo Bajardi, Gianmarco De Francisci Morales

TL;DR

This paper tackles the challenge of inferring sociodemographic attributes on Reddit with transparency and reliability. It introduces a principled Bayesian framework trained on over 850k self-declarations, demonstrating that simple, interpretable models (notably Multinomial Naive Bayes) can outperform embedding-based baselines in both classification and quantification tasks, with ROC AUC gains up to 19% and mean absolute error below 15% in large-scale settings. The approach supports calibrated uncertainty, feature-level interpretability, and population-level analyses, enabling auditable investigations of how identity shapes participation across 1400 subreddits. By grounding demographic inference in user-provided self-declarations rather than researcher assumptions, the work advances ethical, scalable computational social science and provides a practical blueprint for robust demographic analysis on large online platforms.

Abstract

Understanding the sociodemographic composition of online platforms is essential for accurately interpreting digital behavior and its societal implications. Yet, current methods often lack the transparency and reliability required, risking misrepresenting social identities and distorting our understanding of digital society. Here, we introduce a principled framework for sociodemographic inference on Reddit that leverages over 850,000 user self-declarations of age, gender, and partisan affiliation. By training models on sparse user activity signals from this extensive, self-disclosed dataset, we demonstrate that simple probabilistic models, such as Naive Bayes, outperform more complex embedding-based alternatives. Our approach improves classification performance over the state of the art by up to 19% in ROC AUC and maintains quantification error below 15%. The models produce well-calibrated and interpretable outputs, enabling uncertainty estimation and subreddit-level feature importance analysis. More broadly, this work advocates for a shift toward more ethical and transparent computational social science by grounding sociodemographic analysis in user-provided data rather than researcher assumptions.

Uncovering the Sociodemographic Fabric of Reddit

TL;DR

This paper tackles the challenge of inferring sociodemographic attributes on Reddit with transparency and reliability. It introduces a principled Bayesian framework trained on over 850k self-declarations, demonstrating that simple, interpretable models (notably Multinomial Naive Bayes) can outperform embedding-based baselines in both classification and quantification tasks, with ROC AUC gains up to 19% and mean absolute error below 15% in large-scale settings. The approach supports calibrated uncertainty, feature-level interpretability, and population-level analyses, enabling auditable investigations of how identity shapes participation across 1400 subreddits. By grounding demographic inference in user-provided self-declarations rather than researcher assumptions, the work advances ethical, scalable computational social science and provides a practical blueprint for robust demographic analysis on large online platforms.

Abstract

Understanding the sociodemographic composition of online platforms is essential for accurately interpreting digital behavior and its societal implications. Yet, current methods often lack the transparency and reliability required, risking misrepresenting social identities and distorting our understanding of digital society. Here, we introduce a principled framework for sociodemographic inference on Reddit that leverages over 850,000 user self-declarations of age, gender, and partisan affiliation. By training models on sparse user activity signals from this extensive, self-disclosed dataset, we demonstrate that simple probabilistic models, such as Naive Bayes, outperform more complex embedding-based alternatives. Our approach improves classification performance over the state of the art by up to 19% in ROC AUC and maintains quantification error below 15%. The models produce well-calibrated and interpretable outputs, enabling uncertainty estimation and subreddit-level feature importance analysis. More broadly, this work advocates for a shift toward more ethical and transparent computational social science by grounding sociodemographic analysis in user-provided data rather than researcher assumptions.

Paper Structure

This paper contains 24 sections, 2 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Sociodemographic Fabric of Reddit: 2D projection of subreddits colored by sodemographic attributes. We visualize a network of co-participation patterns in 2019 Reddit data among 1,400 subreddits, chosen by popularity. Posts and comments were sampled at 5% and filtered to exclude deleted content and inactive users. We retained users with at least 100 contributions and classified them by age, gender, and partisanship using pre-trained Naive Bayes models with Classify & Count quantifier. Subreddit positions are computed via t-SNE on a PPMI-transformed co-occurrence matrix. Colors indicate the share of users in each subreddit predicted to be (a) older than the median age, (b) female, or (c) Republican-leaning. Marker size reflects subreddit activity, and labels represent descriptive summaries of the subreddits in each of 20 KMeans clusters, generated using GPT-4o. An interactive version of the plot is available at https://federicocinus.github.io/reddit-fabric.
  • Figure 2: ROC curves for each attribute (Year of Birth, Gender, Partisan Affiliation) and model (Naive Bayes (NB), waller2021quantifying (WA)). Models are trained with true supervision, use random oversampling for class imbalance, and are evaluated via 10-fold stratified cross-validation.
  • Figure 3: ROC curves for each attribute (Year of Birth, Gender, Partisan Affiliation) and model (Naive Bayes (NB), waller2021quantifying (WA)). Models are trained with distant supervision, use random oversampling for class imbalance, and are evaluated via 10-fold stratified cross-validation.
  • Figure 4: Quantification curves. MAE obtained each method vs the number of training samples with true declared labels (true supervision).
  • Figure 5: Calibration curves for the different attributes, showing the alignment of prediction scores with true probabilities.
  • ...and 5 more figures