Table of Contents
Fetching ...

Leveraging Large Language Models for Suicide Detection on Social Media with Limited Labels

Vy Nguyen, Chau Pham

TL;DR

An ensemble approach involving prompting with Qwen2-72B-Instruct, and using fine-tuned models such as Llama3-8B, Llama3.1-8B, and Gemma2-9B is developed, providing a promising solution for identifying suicidal content in social media.

Abstract

The increasing frequency of suicidal thoughts highlights the importance of early detection and intervention. Social media platforms, where users often share personal experiences and seek help, could be utilized to identify individuals at risk. However, the large volume of daily posts makes manual review impractical. This paper explores the use of Large Language Models (LLMs) to automatically detect suicidal content in text-based social media posts. We propose a novel method for generating pseudo-labels for unlabeled data by prompting LLMs, along with traditional classification fine-tuning techniques to enhance label accuracy. To create a strong suicide detection model, we develop an ensemble approach involving prompting with Qwen2-72B-Instruct, and using fine-tuned models such as Llama3-8B, Llama3.1-8B, and Gemma2-9B. We evaluate our approach on the dataset of the Suicide Ideation Detection on Social Media Challenge, a track of the IEEE Big Data 2024 Big Data Cup. Additionally, we conduct a comprehensive analysis to assess the impact of different models and fine-tuning strategies on detection performance. Experimental results show that the ensemble model significantly improves the detection accuracy, by 5% points compared with the individual models. It achieves a weight F1 score of 0.770 on the public test set, and 0.731 on the private test set, providing a promising solution for identifying suicidal content in social media. Our analysis shows that the choice of LLMs affects the prompting performance, with larger models providing better accuracy. Our code and checkpoints are publicly available at https://github.com/khanhvynguyen/Suicide_Detection_LLMs.

Leveraging Large Language Models for Suicide Detection on Social Media with Limited Labels

TL;DR

An ensemble approach involving prompting with Qwen2-72B-Instruct, and using fine-tuned models such as Llama3-8B, Llama3.1-8B, and Gemma2-9B is developed, providing a promising solution for identifying suicidal content in social media.

Abstract

The increasing frequency of suicidal thoughts highlights the importance of early detection and intervention. Social media platforms, where users often share personal experiences and seek help, could be utilized to identify individuals at risk. However, the large volume of daily posts makes manual review impractical. This paper explores the use of Large Language Models (LLMs) to automatically detect suicidal content in text-based social media posts. We propose a novel method for generating pseudo-labels for unlabeled data by prompting LLMs, along with traditional classification fine-tuning techniques to enhance label accuracy. To create a strong suicide detection model, we develop an ensemble approach involving prompting with Qwen2-72B-Instruct, and using fine-tuned models such as Llama3-8B, Llama3.1-8B, and Gemma2-9B. We evaluate our approach on the dataset of the Suicide Ideation Detection on Social Media Challenge, a track of the IEEE Big Data 2024 Big Data Cup. Additionally, we conduct a comprehensive analysis to assess the impact of different models and fine-tuning strategies on detection performance. Experimental results show that the ensemble model significantly improves the detection accuracy, by 5% points compared with the individual models. It achieves a weight F1 score of 0.770 on the public test set, and 0.731 on the private test set, providing a promising solution for identifying suicidal content in social media. Our analysis shows that the choice of LLMs affects the prompting performance, with larger models providing better accuracy. Our code and checkpoints are publicly available at https://github.com/khanhvynguyen/Suicide_Detection_LLMs.
Paper Structure (17 sections, 4 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 17 sections, 4 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: An overview of our approach.(a) pseudo-labels generation for unlabeled data. We first use $500$ labeled posts to fine-tune DepRoBERTaposwiata-perelkiewicz-2022-opi and Llama3-8Bdubey2024llama for the classification task. Then, we combine these models with Qwen2-72B-Instruct via prompting to annotate $1{,}500$ posts in the unlabeled dataset. We keep only $\approx 900$ posts for which all three models predict the same and combine these with the $500$ labeled posts to form a new training set (Section \ref{['subsec:generateing_pseudolabels']}). (b) LLMs fine-tuning. We then fine-tune Llama3-8B, Llama3.1-8B, and Gemma2-9B on the newly formed dataset with Macro Double Soft F1 loss (Section \ref{['subsec:fine_tune']}). (c) Model Ensembling. These fine-tuned models are combined with prompting Qwen2-72B-Instruct to create an ensemble model for classifying new user posts (Section \ref{['subsec:ensemble']}).
  • Figure 2: Prompt template with few-shot examples using Chain-of-Thought prompting. Each example consists of a user post followed by three questions corresponding to the three suicide risk levels: Ideation, Behavior, and Attempt. For each question in the example, a sample response is provided. The final part is used to collect the Yes/No answers from the three responses.
  • Figure 3: Prompt template to assess whether the writer mentions moving on from suicide thoughts or attempts. First, an instruction is provided to guide the LLM on the task, highlighted in blue. Following this, a set of few-shot exemplars is presented. Each exemplar includes a user post followed by the corresponding answer. When a new post, highlighted in orange, replaces the placeholder, the LLM is expected to generate an answer based on the context of the post, indicating whether the writer mentions moving on.
  • Figure 4: Comparison of F1 Scores for our models on the Public Board Test Set containing $\mathbf{100}$ posts. The ensemble model shows its robustness and significantly outperforms the individual models, demonstrating an improvement of approximately $5$% points in the F1 Score on the new test set.
  • Figure 5: Confusion Matrices of the models on 500 original labeled posts. The matrices show that each of the models performs reasonably well. Ensemble (f) outperforms individual models, with the main improvement coming from Ideation class.
  • ...and 2 more figures