PsihoRo: Depression and Anxiety Romanian Text Corpus
Alexandra Ciobotaru, Ana-Maria Bucur, Liviu P. Dinu
TL;DR
PsihoRo addresses the lack of Romanian mental-health NLP resources by presenting the first open Romanian corpus for depression and anxiety, assembled from 205 anonymous responses to a triple-part survey (six open-ended questions plus PHQ-9 and GAD-7). The authors apply Ro-LIWC, a Romanian BERT-based emotion detector, and STM-based topic modeling to extract linguistic, affective, and thematic patterns, with LightGBM and SHAP used to link LIWC features to risk labels. Findings show robust correlations between LIWC categories and symptom scores, identifiable emotion patterns corresponding to depression and anxiety, and group-specific topical themes, though predictive regression from text remains challenging due to sample size. PsihoRo thus offers a high-quality resource for cross-linguistic mental-health NLP and sets the stage for larger, more nuanced Romanian analyses and interventions.
Abstract
Psychological corpora in NLP are collections of texts used to analyze human psychology, emotions, and mental health. These texts allow researchers to study psychological constructs, detect mental health issues and analyze emotional language. However, mental health data can be difficult to collect correctly from social media, due to suppositions made by the collectors. A more pragmatic strategy involves gathering data through open-ended questions and then assessing this information with self-report screening surveys. This method was employed successfully for English, a language with a lot of psychological NLP resources. However, this cannot be stated for Romanian, which currently has no open-source mental health corpus. To address this gap, we have created the first corpus for depression and anxiety in Romanian, by utilizing a form with 6 open-ended questions along with the standardized PHQ-9 and GAD-7 screening questionnaires. Consisting of the texts of 205 respondents and although it may seem small, PsihoRo is a first step towards understanding and analyzing texts regarding the mental health of the Romanian population. We employ statistical analysis, text analysis using Romanian LIWC, emotion detection and topic modeling to show what are the most important features of this newly introduced resource to the NLP community.
