Table of Contents
Fetching ...

A Comprehensive Evaluation of Large Language Models on Mental Illnesses

Abdelrahman Hanafi, Mohammed Saad, Noureldin Zahran, Radwa J. Hanafy, Mohammed E. Fouda

TL;DR

The paper conducts a large-scale, cross-architecture evaluation of 33 large language models on three mental-health tasks (binary disorder detection, severity estimation, psychiatric knowledge assessment) using six expert-annotated social-media datasets. It emphasizes systematic prompt engineering (BIN, SEV, KNOW templates) and careful output parsing to achieve reproducible results, revealing that newer, instruction-tuned models often outperform larger but older architectures, and that prompt structure can dramatically swing performance. Key findings show GPT-4 dominates binary detection on several datasets, severity benefits from few-shot learning in many cases, and Llama 3.1 405B excels in psychiatric knowledge, albeit with notable safety-related refusals that impact evaluation. The study highlights practical implications for model selection, safety versus accuracy trade-offs, and the critical role of standardized protocols, while acknowledging ethical, data-quality, and scalability limitations that constrain clinical deployment and generalizability.

Abstract

Large Language Models (LLMs) have shown promise in various domains, including healthcare, with significant potential to transform mental health applications by enabling scalable and accessible solutions. This study aims to provide a comprehensive evaluation of 33 LLMs, ranging from 2 billion to 405+ billion parameters, in performing key mental health tasks using social media data across six datasets. To our knowledge, this represents the largest-scale systematic evaluation of modern LLMs for mental health applications. Models such as GPT-4, Llama 3, Claude, Gemma, Gemini, and Phi-3 were assessed for their zero-shot (ZS) and few-shot (FS) capabilities across three tasks: binary disorder detection, disorder severity evaluation, and psychiatric knowledge assessment. Key findings revealed that models like GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, achieving accuracies up to 85% on certain datasets, while FS learning notably enhanced disorder severity evaluations, reducing the Mean Absolute Error (MAE) by 1.3 points for the Phi-3-mini model. Recent models, such as Llama 3.1 405b, demonstrated exceptional psychiatric knowledge assessment accuracy at 91.2%, while prompt engineering played a crucial role in improving performance across tasks. However, the ethical constraints imposed by many LLM providers limit their ability to respond to sensitive queries, hampering comprehensive performance evaluations. This work highlights both the capabilities and limitations of LLMs in mental health contexts, offering valuable insights for future applications in psychiatry.

A Comprehensive Evaluation of Large Language Models on Mental Illnesses

TL;DR

The paper conducts a large-scale, cross-architecture evaluation of 33 large language models on three mental-health tasks (binary disorder detection, severity estimation, psychiatric knowledge assessment) using six expert-annotated social-media datasets. It emphasizes systematic prompt engineering (BIN, SEV, KNOW templates) and careful output parsing to achieve reproducible results, revealing that newer, instruction-tuned models often outperform larger but older architectures, and that prompt structure can dramatically swing performance. Key findings show GPT-4 dominates binary detection on several datasets, severity benefits from few-shot learning in many cases, and Llama 3.1 405B excels in psychiatric knowledge, albeit with notable safety-related refusals that impact evaluation. The study highlights practical implications for model selection, safety versus accuracy trade-offs, and the critical role of standardized protocols, while acknowledging ethical, data-quality, and scalability limitations that constrain clinical deployment and generalizability.

Abstract

Large Language Models (LLMs) have shown promise in various domains, including healthcare, with significant potential to transform mental health applications by enabling scalable and accessible solutions. This study aims to provide a comprehensive evaluation of 33 LLMs, ranging from 2 billion to 405+ billion parameters, in performing key mental health tasks using social media data across six datasets. To our knowledge, this represents the largest-scale systematic evaluation of modern LLMs for mental health applications. Models such as GPT-4, Llama 3, Claude, Gemma, Gemini, and Phi-3 were assessed for their zero-shot (ZS) and few-shot (FS) capabilities across three tasks: binary disorder detection, disorder severity evaluation, and psychiatric knowledge assessment. Key findings revealed that models like GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, achieving accuracies up to 85% on certain datasets, while FS learning notably enhanced disorder severity evaluations, reducing the Mean Absolute Error (MAE) by 1.3 points for the Phi-3-mini model. Recent models, such as Llama 3.1 405b, demonstrated exceptional psychiatric knowledge assessment accuracy at 91.2%, while prompt engineering played a crucial role in improving performance across tasks. However, the ethical constraints imposed by many LLM providers limit their ability to respond to sensitive queries, hampering comprehensive performance evaluations. This work highlights both the capabilities and limitations of LLMs in mental health contexts, offering valuable insights for future applications in psychiatry.
Paper Structure (65 sections, 1 equation, 6 figures, 15 tables)

This paper contains 65 sections, 1 equation, 6 figures, 15 tables.

Figures (6)

  • Figure 1: Disorder Severity Evaluation Task Evolution
  • Figure 2: Results of Prompt BIN-1 on Task 1
  • Figure 3: Average Deviation of Model Accuracy from the Best Performing Model's Accuracy on each dataset of Task 1 (Dreaddit, SDCNL, DepSeverity, SAD, DEPTWEET, RED SAM) using Prompt BIN-1. Deviation is calculated for each dataset as the absolute difference in accuracy between a given model and the best performing model on that specific dataset. These deviations are then averaged across all datasets to produce the final values shown.
  • Figure 4: Invalid Response Statistics Sorted by model release date
  • Figure 5: Results of all models on Task 3 on the MedMCQA dataset
  • ...and 1 more figures