Data Quality Matters: Suicide Intention Detection on Social Media Posts Using RoBERTa-CNN
Emily Lin, Jian Sun, Hsingyu Chen, Mohammad H. Mahoor
TL;DR
The paper addresses detecting suicide intent in social media posts using a RoBERTa-CNN architecture, combining a RoBERTa backbone with a CNN head to capture both contextual and local linguistic features. On the Suicide and Depression Detection (SDD) dataset, the model achieves about 98% mean accuracy with a small standard deviation, outperforming several baselines. A key finding is that data quality—cleaning noisy text via manual methods and OpenAI API—significantly boosts performance. The work suggests that high-quality data, coupled with a hybrid architectures like RoBERTa-CNN, can improve automatic SID with potential for real-world intervention, while noting computational costs and proposing future multi-modal extensions.
Abstract
Suicide remains a pressing global health concern, necessitating innovative approaches for early detection and intervention. This paper focuses on identifying suicidal intentions in posts from the SuicideWatch subreddit by proposing a novel deep-learning approach that utilizes the state-of-the-art RoBERTa-CNN model. The robustly Optimized BERT Pretraining Approach (RoBERTa) excels at capturing textual nuances and forming semantic relationships within the text. The remaining Convolutional Neural Network (CNN) head enhances RoBERTa's capacity to discern critical patterns from extensive datasets. To evaluate RoBERTa-CNN, we conducted experiments on the Suicide and Depression Detection dataset, yielding promising results. For instance, RoBERTa-CNN achieves a mean accuracy of 98% with a standard deviation (STD) of 0.0009. Additionally, we found that data quality significantly impacts the training of a robust model. To improve data quality, we removed noise from the text data while preserving its contextual content through either manually cleaning or utilizing the OpenAI API.
