Table of Contents
Fetching ...

PsychiatryBench: A Multi-Task Benchmark for LLMs in Psychiatry

Aya E. Fouda, Abdelrahamn A. Hassan, Radwa J. Hanafy, Mohammed E. Fouda

TL;DR

PsychiatryBench is a rigorously expert-grounded benchmark for evaluating LLMs in psychiatry, grounded exclusively in authoritative textbooks and casebooks to assess diagnostic reasoning, treatment planning, and longitudinal clinical decision-making. It spans 11 task types with 5,188 expert-annotated items, and employs both conventional metrics and an LLM-as-judge similarity framework to gauge performance across 15 models, including frontier generalists and medical-specialized systems. The study reveals robust improvements with recent architectures yet highlights persistent challenges in fine-grained classification, EMI reasoning, and task-format sensitivity, underscoring the need for safety, cultural validity, and methodology enhancements. PsychiatryBench offers a modular, extensible platform to benchmark and improve psychiatric reasoning in LLMs, aiming to bridge current gaps between AI capabilities and real-world clinical practice.

Abstract

Large language models (LLMs) offer significant potential in enhancing psychiatric practice, from improving diagnostic accuracy to streamlining clinical documentation and therapeutic support. However, existing evaluation resources heavily rely on small clinical interview corpora, social media posts, or synthetic dialogues, which limits their clinical validity and fails to capture the full complexity of diagnostic reasoning. In this work, we introduce PsychiatryBench, a rigorously curated benchmark grounded exclusively in authoritative, expert-validated psychiatric textbooks and casebooks. PsychiatryBench comprises eleven distinct question-answering tasks ranging from diagnostic reasoning and treatment planning to longitudinal follow-up, management planning, clinical approach, sequential case analysis, and multiple-choice/extended matching formats totaling 5,188 expert-annotated items. {\color{red}We evaluate a diverse set of frontier LLMs (including Google Gemini, DeepSeek, Sonnet 4.5, and GPT 5) alongside leading open-source medical models such as MedGemma using both conventional metrics and an "LLM-as-judge" similarity scoring framework. Our results reveal substantial gaps in clinical consistency and safety, particularly in multi-turn follow-up and management tasks, underscoring the need for specialized model tuning and more robust evaluation paradigms. PsychiatryBench offers a modular, extensible platform for benchmarking and improving LLM performance in mental health applications.

PsychiatryBench: A Multi-Task Benchmark for LLMs in Psychiatry

TL;DR

PsychiatryBench is a rigorously expert-grounded benchmark for evaluating LLMs in psychiatry, grounded exclusively in authoritative textbooks and casebooks to assess diagnostic reasoning, treatment planning, and longitudinal clinical decision-making. It spans 11 task types with 5,188 expert-annotated items, and employs both conventional metrics and an LLM-as-judge similarity framework to gauge performance across 15 models, including frontier generalists and medical-specialized systems. The study reveals robust improvements with recent architectures yet highlights persistent challenges in fine-grained classification, EMI reasoning, and task-format sensitivity, underscoring the need for safety, cultural validity, and methodology enhancements. PsychiatryBench offers a modular, extensible platform to benchmark and improve psychiatric reasoning in LLMs, aiming to bridge current gaps between AI capabilities and real-world clinical practice.

Abstract

Large language models (LLMs) offer significant potential in enhancing psychiatric practice, from improving diagnostic accuracy to streamlining clinical documentation and therapeutic support. However, existing evaluation resources heavily rely on small clinical interview corpora, social media posts, or synthetic dialogues, which limits their clinical validity and fails to capture the full complexity of diagnostic reasoning. In this work, we introduce PsychiatryBench, a rigorously curated benchmark grounded exclusively in authoritative, expert-validated psychiatric textbooks and casebooks. PsychiatryBench comprises eleven distinct question-answering tasks ranging from diagnostic reasoning and treatment planning to longitudinal follow-up, management planning, clinical approach, sequential case analysis, and multiple-choice/extended matching formats totaling 5,188 expert-annotated items. {\color{red}We evaluate a diverse set of frontier LLMs (including Google Gemini, DeepSeek, Sonnet 4.5, and GPT 5) alongside leading open-source medical models such as MedGemma using both conventional metrics and an "LLM-as-judge" similarity scoring framework. Our results reveal substantial gaps in clinical consistency and safety, particularly in multi-turn follow-up and management tasks, underscoring the need for specialized model tuning and more robust evaluation paradigms. PsychiatryBench offers a modular, extensible platform for benchmarking and improving LLM performance in mental health applications.

Paper Structure

This paper contains 71 sections, 14 equations, 25 figures, 10 tables.

Figures (25)

  • Figure 1: Visual representation of the PsychiatryBench dataset development pipeline and its final evaluation subset.
  • Figure 1: Diagnosis Prediction Prompt
  • Figure 2: Workflow for the PsychiatryBench study: manual extraction and processing of clinical QA samples followed by LLM evaluation across eleven task types.
  • Figure 2: Diagnosis Evaluation Prompt
  • Figure 3: Each bubble shows a model, positioned by release date and average performance. Bubble size denotes parameter count, and color represents model family. Larger, newer models generally achieve higher performance.
  • ...and 20 more figures