Table of Contents
Fetching ...

Evaluation of ChatGPT for NLP-based Mental Health Applications

Bishal Lamichhane

TL;DR

<3-5 sentence high-level summary> This paper evaluates ChatGPT's zero-shot classification on three NLP-based mental health tasks—stress, depression, and suicidality—using public social-media datasets. It uses a simple class-prediction prompt with the GPT-3.5-turbo backend and reports F1 and confusion metrics, comparing against a dominant-class baseline. ChatGPT achieves F1 scores of 0.73 for stress, 0.86 for depression, and 0.37 for suicidality, with notably lower performance and higher confusion in the five-class suicidality task. The findings indicate that LLMs can serve as backends for mental health NLP tasks but require further tuning, evaluation with stronger backends (e.g., GPT-4), and careful consideration of dataset and annotation limitations for clinical reliability.

Abstract

Large language models (LLM) have been successful in several natural language understanding tasks and could be relevant for natural language processing (NLP)-based mental health application research. In this work, we report the performance of LLM-based ChatGPT (with gpt-3.5-turbo backend) in three text-based mental health classification tasks: stress detection (2-class classification), depression detection (2-class classification), and suicidality detection (5-class classification). We obtained annotated social media posts for the three classification tasks from public datasets. Then ChatGPT API classified the social media posts with an input prompt for classification. We obtained F1 scores of 0.73, 0.86, and 0.37 for stress detection, depression detection, and suicidality detection, respectively. A baseline model that always predicted the dominant class resulted in F1 scores of 0.35, 0.60, and 0.19. The zero-shot classification accuracy obtained with ChatGPT indicates a potential use of language models for mental health classification tasks.

Evaluation of ChatGPT for NLP-based Mental Health Applications

TL;DR

<3-5 sentence high-level summary> This paper evaluates ChatGPT's zero-shot classification on three NLP-based mental health tasks—stress, depression, and suicidality—using public social-media datasets. It uses a simple class-prediction prompt with the GPT-3.5-turbo backend and reports F1 and confusion metrics, comparing against a dominant-class baseline. ChatGPT achieves F1 scores of 0.73 for stress, 0.86 for depression, and 0.37 for suicidality, with notably lower performance and higher confusion in the five-class suicidality task. The findings indicate that LLMs can serve as backends for mental health NLP tasks but require further tuning, evaluation with stronger backends (e.g., GPT-4), and careful consideration of dataset and annotation limitations for clinical reliability.

Abstract

Large language models (LLM) have been successful in several natural language understanding tasks and could be relevant for natural language processing (NLP)-based mental health application research. In this work, we report the performance of LLM-based ChatGPT (with gpt-3.5-turbo backend) in three text-based mental health classification tasks: stress detection (2-class classification), depression detection (2-class classification), and suicidality detection (5-class classification). We obtained annotated social media posts for the three classification tasks from public datasets. Then ChatGPT API classified the social media posts with an input prompt for classification. We obtained F1 scores of 0.73, 0.86, and 0.37 for stress detection, depression detection, and suicidality detection, respectively. A baseline model that always predicted the dominant class resulted in F1 scores of 0.35, 0.60, and 0.19. The zero-shot classification accuracy obtained with ChatGPT indicates a potential use of language models for mental health classification tasks.
Paper Structure (9 sections, 3 figures, 1 table)

This paper contains 9 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Confusion matrix for the prediction from ChatGPT on the stress detection task.
  • Figure 2: Confusion matrix for the prediction from ChatGPT on the depression detection task.
  • Figure 3: Confusion matrix for the prediction from ChatGPT on the 5-class suicidality detection task. When ChatGPT could not assign any of the five classes to the input text, we labeled it as belonging to the None class.