Uncovering the Sociodemographic Fabric of Reddit
Federico Cinus, Corrado Monti, Paolo Bajardi, Gianmarco De Francisci Morales
TL;DR
This paper tackles the challenge of inferring sociodemographic attributes on Reddit with transparency and reliability. It introduces a principled Bayesian framework trained on over 850k self-declarations, demonstrating that simple, interpretable models (notably Multinomial Naive Bayes) can outperform embedding-based baselines in both classification and quantification tasks, with ROC AUC gains up to 19% and mean absolute error below 15% in large-scale settings. The approach supports calibrated uncertainty, feature-level interpretability, and population-level analyses, enabling auditable investigations of how identity shapes participation across 1400 subreddits. By grounding demographic inference in user-provided self-declarations rather than researcher assumptions, the work advances ethical, scalable computational social science and provides a practical blueprint for robust demographic analysis on large online platforms.
Abstract
Understanding the sociodemographic composition of online platforms is essential for accurately interpreting digital behavior and its societal implications. Yet, current methods often lack the transparency and reliability required, risking misrepresenting social identities and distorting our understanding of digital society. Here, we introduce a principled framework for sociodemographic inference on Reddit that leverages over 850,000 user self-declarations of age, gender, and partisan affiliation. By training models on sparse user activity signals from this extensive, self-disclosed dataset, we demonstrate that simple probabilistic models, such as Naive Bayes, outperform more complex embedding-based alternatives. Our approach improves classification performance over the state of the art by up to 19% in ROC AUC and maintains quantification error below 15%. The models produce well-calibrated and interpretable outputs, enabling uncertainty estimation and subreddit-level feature importance analysis. More broadly, this work advocates for a shift toward more ethical and transparent computational social science by grounding sociodemographic analysis in user-provided data rather than researcher assumptions.
