A Comprehensive Evaluation of Large Language Models on Mental Illnesses

Abdelrahman Hanafi; Mohammed Saad; Noureldin Zahran; Radwa J. Hanafy; Mohammed E. Fouda

A Comprehensive Evaluation of Large Language Models on Mental Illnesses

Abdelrahman Hanafi, Mohammed Saad, Noureldin Zahran, Radwa J. Hanafy, Mohammed E. Fouda

TL;DR

The paper conducts a large-scale, cross-architecture evaluation of 33 large language models on three mental-health tasks (binary disorder detection, severity estimation, psychiatric knowledge assessment) using six expert-annotated social-media datasets. It emphasizes systematic prompt engineering (BIN, SEV, KNOW templates) and careful output parsing to achieve reproducible results, revealing that newer, instruction-tuned models often outperform larger but older architectures, and that prompt structure can dramatically swing performance. Key findings show GPT-4 dominates binary detection on several datasets, severity benefits from few-shot learning in many cases, and Llama 3.1 405B excels in psychiatric knowledge, albeit with notable safety-related refusals that impact evaluation. The study highlights practical implications for model selection, safety versus accuracy trade-offs, and the critical role of standardized protocols, while acknowledging ethical, data-quality, and scalability limitations that constrain clinical deployment and generalizability.

Abstract

Large Language Models (LLMs) have shown promise in various domains, including healthcare, with significant potential to transform mental health applications by enabling scalable and accessible solutions. This study aims to provide a comprehensive evaluation of 33 LLMs, ranging from 2 billion to 405+ billion parameters, in performing key mental health tasks using social media data across six datasets. To our knowledge, this represents the largest-scale systematic evaluation of modern LLMs for mental health applications. Models such as GPT-4, Llama 3, Claude, Gemma, Gemini, and Phi-3 were assessed for their zero-shot (ZS) and few-shot (FS) capabilities across three tasks: binary disorder detection, disorder severity evaluation, and psychiatric knowledge assessment. Key findings revealed that models like GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, achieving accuracies up to 85% on certain datasets, while FS learning notably enhanced disorder severity evaluations, reducing the Mean Absolute Error (MAE) by 1.3 points for the Phi-3-mini model. Recent models, such as Llama 3.1 405b, demonstrated exceptional psychiatric knowledge assessment accuracy at 91.2%, while prompt engineering played a crucial role in improving performance across tasks. However, the ethical constraints imposed by many LLM providers limit their ability to respond to sensitive queries, hampering comprehensive performance evaluations. This work highlights both the capabilities and limitations of LLMs in mental health contexts, offering valuable insights for future applications in psychiatry.

A Comprehensive Evaluation of Large Language Models on Mental Illnesses

TL;DR

Abstract

Paper Structure (65 sections, 1 equation, 6 figures, 15 tables)

This paper contains 65 sections, 1 equation, 6 figures, 15 tables.

Introduction
Related Works
Evaluating s on Mental Health Tasks
Fine-tuning s for mental health tasks
s for Data Augmentation and Chatbot Development
Benchmarks for Evaluating s in Psychiatry
Literature Reviews
Methodology Implementation
Experimental Setup
Task 1: Binary Disorder Detection
Task 2: Disorder Severity Evaluation
Task 3: Psychiatric Knowledge Assessment
Initial Exploration of fine-tuning
Datasets
Dreaddit turcan2019dreaddit
...and 50 more sections

Figures (6)

Figure 1: Disorder Severity Evaluation Task Evolution
Figure 2: Results of Prompt BIN-1 on Task 1
Figure 3: Average Deviation of Model Accuracy from the Best Performing Model's Accuracy on each dataset of Task 1 (Dreaddit, SDCNL, DepSeverity, SAD, DEPTWEET, RED SAM) using Prompt BIN-1. Deviation is calculated for each dataset as the absolute difference in accuracy between a given model and the best performing model on that specific dataset. These deviations are then averaged across all datasets to produce the final values shown.
Figure 4: Invalid Response Statistics Sorted by model release date
Figure 5: Results of all models on Task 3 on the MedMCQA dataset
...and 1 more figures

A Comprehensive Evaluation of Large Language Models on Mental Illnesses

TL;DR

Abstract

A Comprehensive Evaluation of Large Language Models on Mental Illnesses

Authors

TL;DR

Abstract

Table of Contents

Figures (6)