Table of Contents
Fetching ...

Su-RoBERTa: A Semi-supervised Approach to Predicting Suicide Risk through Social Media using Base Language Models

Chayan Tank, Shaina Mehta, Sarthak Pol, Vinayak Katoch, Avinash Anand, Raj Jaiswal, Rajiv Ratn Shah

TL;DR

Su-RoBERTa demonstrates that a fine-tuned RoBERTa model, trained with semi-supervised data augmentation via GPT-2 and RoBERTa NLPAug, can achieve strong suicide risk prediction on Reddit data with fewer than 500M parameters, reaching a final weighted F1 of $69.84\%$ on the IEEE BigData 2024 task and ranking 10th. The approach addresses data imbalance through two augmentation streams and a two-iteration pseudo-labeling regime, highlighting the practicality and efficiency of base language models compared with larger LLMs. The study compares against a GPT-4-labeled baseline and a classical SVC, illustrating the trade-offs between accuracy and compute for real-world mental-health monitoring. Looking forward, the authors propose multi-modal data integration and enhanced explainability to support clinical validation and deployment on mobile and edge devices.

Abstract

In recent times, more and more people are posting about their mental states across various social media platforms. Leveraging this data, AI-based systems can be developed that help in assessing the mental health of individuals, such as suicide risk. This paper is a study done on suicidal risk assessments using Reddit data leveraging Base language models to identify patterns from social media posts. We have demonstrated that using smaller language models, i.e., less than 500M parameters, can also be effective in contrast to LLMs with greater than 500M parameters. We propose Su-RoBERTa, a fine-tuned RoBERTa on suicide risk prediction task that utilized both the labeled and unlabeled Reddit data and tackled class imbalance by data augmentation using GPT-2 model. Our Su-RoBERTa model attained a 69.84% weighted F1 score during the Final evaluation. This paper demonstrates the effectiveness of Base language models for the analysis of the risk factors related to mental health with an efficient computation pipeline

Su-RoBERTa: A Semi-supervised Approach to Predicting Suicide Risk through Social Media using Base Language Models

TL;DR

Su-RoBERTa demonstrates that a fine-tuned RoBERTa model, trained with semi-supervised data augmentation via GPT-2 and RoBERTa NLPAug, can achieve strong suicide risk prediction on Reddit data with fewer than 500M parameters, reaching a final weighted F1 of on the IEEE BigData 2024 task and ranking 10th. The approach addresses data imbalance through two augmentation streams and a two-iteration pseudo-labeling regime, highlighting the practicality and efficiency of base language models compared with larger LLMs. The study compares against a GPT-4-labeled baseline and a classical SVC, illustrating the trade-offs between accuracy and compute for real-world mental-health monitoring. Looking forward, the authors propose multi-modal data integration and enhanced explainability to support clinical validation and deployment on mobile and edge devices.

Abstract

In recent times, more and more people are posting about their mental states across various social media platforms. Leveraging this data, AI-based systems can be developed that help in assessing the mental health of individuals, such as suicide risk. This paper is a study done on suicidal risk assessments using Reddit data leveraging Base language models to identify patterns from social media posts. We have demonstrated that using smaller language models, i.e., less than 500M parameters, can also be effective in contrast to LLMs with greater than 500M parameters. We propose Su-RoBERTa, a fine-tuned RoBERTa on suicide risk prediction task that utilized both the labeled and unlabeled Reddit data and tackled class imbalance by data augmentation using GPT-2 model. Our Su-RoBERTa model attained a 69.84% weighted F1 score during the Final evaluation. This paper demonstrates the effectiveness of Base language models for the analysis of the risk factors related to mental health with an efficient computation pipeline

Paper Structure

This paper contains 13 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The Proposed Base Language Model approach for SuRoBERTa: Stage 1: Data augmentation of labeled samples using GPT-2 model. Stage 2: Proposed Semi-supervised Learning pipeline for Su-RoBERTa. The Fire symbol represents the model is being Fine-Tuned.
  • Figure 2: Su-RoBERTa Semi-supervised Fine-tuning pipeline mentioned in Stage 2 of Figure \ref{['fig:architecture_image']}
  • Figure 3: Original Sample distribution of the dataset
  • Figure 4: Augmented Sample Distribution of the dataset
  • Figure 5: The Classical Semi-Supervised pipeline: Stage 1: Data augmentation of labeled samples using NLPAug + RoBERTa model. Stage 2: Using Sentence Transformer for extracting the embeddings of both Labeled and unlabeled samples and then Vertically stacking them for further processing. Stage 3: Classical Semi-supervised learning pipeline using SVMs. The Fire symbol represents the classifier being Trained.
  • ...and 2 more figures