Table of Contents
Fetching ...

CARMA: Comprehensive Automatically-annotated Reddit Mental Health Dataset for Arabic

Saad Mankarious, Ayah Zirikly

TL;DR

CARMA presents the first large-scale, automatically annotated Arabic mental health dataset derived from Reddit, spanning six conditions and a control group. It combines self-reported diagnosis labeling with rigorous cleaning and dialect filtering to yield a robust resource (~340k diagnosed posts; ~1M control posts) for lexical analysis and model benchmarking. The work demonstrates distinct condition-specific linguistic signals, validates multiple modeling approaches (hybrid pipelines and end-to-end transformer fine-tuning), and establishes strong baselines that advance mental health detection in underrepresented languages. By showcasing scalability, diversity, and actionable linguistic insights, CARMA offers a valuable foundation for early detection, cross-linguistic NLP research, and responsible deployment in Arabic-speaking contexts.

Abstract

Mental health disorders affect millions worldwide, yet early detection remains a major challenge, particularly for Arabic-speaking populations where resources are limited and mental health discourse is often discouraged due to cultural stigma. While substantial research has focused on English-language mental health detection, Arabic remains significantly underexplored, partly due to the scarcity of annotated datasets. We present CARMA, the first automatically annotated large-scale dataset of Arabic Reddit posts. The dataset encompasses six mental health conditions, such as Anxiety, Autism, and Depression, and a control group. CARMA surpasses existing resources in both scale and diversity. We conduct qualitative and quantitative analyses of lexical and semantic differences between users, providing insights into the linguistic markers of specific mental health conditions. To demonstrate the dataset's potential for further mental health analysis, we perform classification experiments using a range of models, from shallow classifiers to large language models. Our results highlight the promise of advancing mental health detection in underrepresented languages such as Arabic.

CARMA: Comprehensive Automatically-annotated Reddit Mental Health Dataset for Arabic

TL;DR

CARMA presents the first large-scale, automatically annotated Arabic mental health dataset derived from Reddit, spanning six conditions and a control group. It combines self-reported diagnosis labeling with rigorous cleaning and dialect filtering to yield a robust resource (~340k diagnosed posts; ~1M control posts) for lexical analysis and model benchmarking. The work demonstrates distinct condition-specific linguistic signals, validates multiple modeling approaches (hybrid pipelines and end-to-end transformer fine-tuning), and establishes strong baselines that advance mental health detection in underrepresented languages. By showcasing scalability, diversity, and actionable linguistic insights, CARMA offers a valuable foundation for early detection, cross-linguistic NLP research, and responsible deployment in Arabic-speaking contexts.

Abstract

Mental health disorders affect millions worldwide, yet early detection remains a major challenge, particularly for Arabic-speaking populations where resources are limited and mental health discourse is often discouraged due to cultural stigma. While substantial research has focused on English-language mental health detection, Arabic remains significantly underexplored, partly due to the scarcity of annotated datasets. We present CARMA, the first automatically annotated large-scale dataset of Arabic Reddit posts. The dataset encompasses six mental health conditions, such as Anxiety, Autism, and Depression, and a control group. CARMA surpasses existing resources in both scale and diversity. We conduct qualitative and quantitative analyses of lexical and semantic differences between users, providing insights into the linguistic markers of specific mental health conditions. To demonstrate the dataset's potential for further mental health analysis, we perform classification experiments using a range of models, from shallow classifiers to large language models. Our results highlight the promise of advancing mental health detection in underrepresented languages such as Arabic.

Paper Structure

This paper contains 20 sections, 3 figures, 7 tables, 1 algorithm.

Figures (3)

  • Figure 1: Key steps in dataset construction, beginning with raw Reddit data and culminating in the final datasets.
  • Figure 2: Descriptive language of different groups in the dataset.
  • Figure 3: Dialects represented in the dataset.