Your Model Is Not Predicting Depression Well And That Is Why: A Case Study of PRIMATE Dataset
Kirill Milintsevich, Kairit Sirts, Gaël Dias
TL;DR
This paper investigates annotation quality in PHQ-9-based depression labels within the PRIMATE dataset and shows that crowd-sourced annotations can yield false positives for anhedonia. By having a mental health professional reannotate 170 posts and adding fine-grained labels with textual spans, the authors demonstrate substantial label invalidity and provide a higher-quality test set for anhedonia detection. They evaluate several pre-trained language models on PRIMATE and show that model performance alone does not compensate for annotation quality, underscoring the need for domain-expert annotation pipelines and standardized QC. The refined annotations are released under a Data Use Agreement to support more reliable NLP-based mental-health assessment benchmarks.
Abstract
This paper addresses the quality of annotations in mental health datasets used for NLP-based depression level estimation from social media texts. While previous research relies on social media-based datasets annotated with binary categories, i.e. depressed or non-depressed, recent datasets such as D2S and PRIMATE aim for nuanced annotations using PHQ-9 symptoms. However, most of these datasets rely on crowd workers without the domain knowledge for annotation. Focusing on the PRIMATE dataset, our study reveals concerns regarding annotation validity, particularly for the lack of interest or pleasure symptom. Through reannotation by a mental health professional, we introduce finer labels and textual spans as evidence, identifying a notable number of false positives. Our refined annotations, to be released under a Data Use Agreement, offer a higher-quality test set for anhedonia detection. This study underscores the necessity of addressing annotation quality issues in mental health datasets, advocating for improved methodologies to enhance NLP model reliability in mental health assessments.
