Evaluation of ChatGPT for NLP-based Mental Health Applications
Bishal Lamichhane
TL;DR
<3-5 sentence high-level summary> This paper evaluates ChatGPT's zero-shot classification on three NLP-based mental health tasks—stress, depression, and suicidality—using public social-media datasets. It uses a simple class-prediction prompt with the GPT-3.5-turbo backend and reports F1 and confusion metrics, comparing against a dominant-class baseline. ChatGPT achieves F1 scores of 0.73 for stress, 0.86 for depression, and 0.37 for suicidality, with notably lower performance and higher confusion in the five-class suicidality task. The findings indicate that LLMs can serve as backends for mental health NLP tasks but require further tuning, evaluation with stronger backends (e.g., GPT-4), and careful consideration of dataset and annotation limitations for clinical reliability.
Abstract
Large language models (LLM) have been successful in several natural language understanding tasks and could be relevant for natural language processing (NLP)-based mental health application research. In this work, we report the performance of LLM-based ChatGPT (with gpt-3.5-turbo backend) in three text-based mental health classification tasks: stress detection (2-class classification), depression detection (2-class classification), and suicidality detection (5-class classification). We obtained annotated social media posts for the three classification tasks from public datasets. Then ChatGPT API classified the social media posts with an input prompt for classification. We obtained F1 scores of 0.73, 0.86, and 0.37 for stress detection, depression detection, and suicidality detection, respectively. A baseline model that always predicted the dominant class resulted in F1 scores of 0.35, 0.60, and 0.19. The zero-shot classification accuracy obtained with ChatGPT indicates a potential use of language models for mental health classification tasks.
