A Comprehensive Evaluation of Large Language Models on Mental Illnesses
Abdelrahman Hanafi, Mohammed Saad, Noureldin Zahran, Radwa J. Hanafy, Mohammed E. Fouda
TL;DR
The paper conducts a large-scale, cross-architecture evaluation of 33 large language models on three mental-health tasks (binary disorder detection, severity estimation, psychiatric knowledge assessment) using six expert-annotated social-media datasets. It emphasizes systematic prompt engineering (BIN, SEV, KNOW templates) and careful output parsing to achieve reproducible results, revealing that newer, instruction-tuned models often outperform larger but older architectures, and that prompt structure can dramatically swing performance. Key findings show GPT-4 dominates binary detection on several datasets, severity benefits from few-shot learning in many cases, and Llama 3.1 405B excels in psychiatric knowledge, albeit with notable safety-related refusals that impact evaluation. The study highlights practical implications for model selection, safety versus accuracy trade-offs, and the critical role of standardized protocols, while acknowledging ethical, data-quality, and scalability limitations that constrain clinical deployment and generalizability.
Abstract
Large Language Models (LLMs) have shown promise in various domains, including healthcare, with significant potential to transform mental health applications by enabling scalable and accessible solutions. This study aims to provide a comprehensive evaluation of 33 LLMs, ranging from 2 billion to 405+ billion parameters, in performing key mental health tasks using social media data across six datasets. To our knowledge, this represents the largest-scale systematic evaluation of modern LLMs for mental health applications. Models such as GPT-4, Llama 3, Claude, Gemma, Gemini, and Phi-3 were assessed for their zero-shot (ZS) and few-shot (FS) capabilities across three tasks: binary disorder detection, disorder severity evaluation, and psychiatric knowledge assessment. Key findings revealed that models like GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, achieving accuracies up to 85% on certain datasets, while FS learning notably enhanced disorder severity evaluations, reducing the Mean Absolute Error (MAE) by 1.3 points for the Phi-3-mini model. Recent models, such as Llama 3.1 405b, demonstrated exceptional psychiatric knowledge assessment accuracy at 91.2%, while prompt engineering played a crucial role in improving performance across tasks. However, the ethical constraints imposed by many LLM providers limit their ability to respond to sensitive queries, hampering comprehensive performance evaluations. This work highlights both the capabilities and limitations of LLMs in mental health contexts, offering valuable insights for future applications in psychiatry.
