Table of Contents
Fetching ...

MindSET: Advancing Mental Health Benchmarking through Large-Scale Social Media Data

Saad Mankarious, Ayah Zirikly, Daniel Wiechmann, Elma Kerz, Edward Kempa, Yu Qiao

TL;DR

MindSET introduces a large-scale, rigorously cleaned Reddit-based benchmark for mental health analysis, addressing noise, multilingual content, and data accessibility limitations of prior resources. By identifying diagnosed users via high-precision self-diagnosis patterns and pairing them with behaviorally matched controls, MindSET delivers over seven diagnosed conditions with robust cleaning (language filtering, deduplication, NSFW removal) and standardized splits. LIWC-based analyses reveal clear psycholinguistic markers distinguishing diagnosed from control users, while transformer- and BoW-based classifiers achieve state-of-the-art performance, including up to +$18$ in $F1$ for Autism over prior baselines. The dataset supports scalable, interpretable, and ethically-guided mental health NLP research with implications for early risk detection, longitudinal analysis, and cross-platform validation.

Abstract

Social media data has become a vital resource for studying mental health, offering real-time insights into thoughts, emotions, and behaviors that traditional methods often miss. Progress in this area has been facilitated by benchmark datasets for mental health analysis; however, most existing benchmarks have become outdated due to limited data availability, inadequate cleaning, and the inherently diverse nature of social media content (e.g., multilingual and harmful material). We present a new benchmark dataset, \textbf{MindSET}, curated from Reddit using self-reported diagnoses to address these limitations. The annotated dataset contains over \textbf{13M} annotated posts across seven mental health conditions, more than twice the size of previous benchmarks. To ensure data quality, we applied rigorous preprocessing steps, including language filtering, and removal of Not Safe for Work (NSFW) and duplicate content. We further performed a linguistic analysis using LIWC to examine psychological term frequencies across the eight groups represented in the dataset. To demonstrate the dataset utility, we conducted binary classification experiments for diagnosis detection using both fine-tuned language models and Bag-of-Words (BoW) features. Models trained on MindSET consistently outperformed those trained on previous benchmarks, achieving up to an \textbf{18-point} improvement in F1 for Autism detection. Overall, MindSET provides a robust foundation for researchers exploring the intersection of social media and mental health, supporting both early risk detection and deeper analysis of emerging psychological trends.

MindSET: Advancing Mental Health Benchmarking through Large-Scale Social Media Data

TL;DR

MindSET introduces a large-scale, rigorously cleaned Reddit-based benchmark for mental health analysis, addressing noise, multilingual content, and data accessibility limitations of prior resources. By identifying diagnosed users via high-precision self-diagnosis patterns and pairing them with behaviorally matched controls, MindSET delivers over seven diagnosed conditions with robust cleaning (language filtering, deduplication, NSFW removal) and standardized splits. LIWC-based analyses reveal clear psycholinguistic markers distinguishing diagnosed from control users, while transformer- and BoW-based classifiers achieve state-of-the-art performance, including up to + in for Autism over prior baselines. The dataset supports scalable, interpretable, and ethically-guided mental health NLP research with implications for early risk detection, longitudinal analysis, and cross-platform validation.

Abstract

Social media data has become a vital resource for studying mental health, offering real-time insights into thoughts, emotions, and behaviors that traditional methods often miss. Progress in this area has been facilitated by benchmark datasets for mental health analysis; however, most existing benchmarks have become outdated due to limited data availability, inadequate cleaning, and the inherently diverse nature of social media content (e.g., multilingual and harmful material). We present a new benchmark dataset, \textbf{MindSET}, curated from Reddit using self-reported diagnoses to address these limitations. The annotated dataset contains over \textbf{13M} annotated posts across seven mental health conditions, more than twice the size of previous benchmarks. To ensure data quality, we applied rigorous preprocessing steps, including language filtering, and removal of Not Safe for Work (NSFW) and duplicate content. We further performed a linguistic analysis using LIWC to examine psychological term frequencies across the eight groups represented in the dataset. To demonstrate the dataset utility, we conducted binary classification experiments for diagnosis detection using both fine-tuned language models and Bag-of-Words (BoW) features. Models trained on MindSET consistently outperformed those trained on previous benchmarks, achieving up to an \textbf{18-point} improvement in F1 for Autism detection. Overall, MindSET provides a robust foundation for researchers exploring the intersection of social media and mental health, supporting both early risk detection and deeper analysis of emerging psychological trends.

Paper Structure

This paper contains 22 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Step-by-step end-to-end pipeline for dataset construction. The output of each stage is shown in the blue rectangles below.
  • Figure 2: Distinction between confirmed self-reported diagnoses (Statement 1) and tentative or uncertain expressions (Statement 2). Only patterns resembling Statement 1 were used.
  • Figure 3: Cleaning pipeline. Each step shows the percentage decrease in data volume. Example content filtered at each stage is displayed, except for the middle step, which contains inappropriate material.