Table of Contents
Fetching ...

MentalBench: A Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models

Hoyun Song, Migyeong Kang, Jisu Shin, Jihyun Kim, Chanbi Park, Hangyeol Yoo, Jihyun An, Alice Oh, Jinyoung Han, KyungTae Lim

TL;DR

The experiments show that while state-of-the-art LLMs perform well on structured queries probing DSM-5 knowledge, they struggle to calibrate confidence in diagnostic decision-making when distinguishing between clinically overlapping disorders, revealing evaluation gaps not captured by existing benchmarks.

Abstract

We introduce MentalBench, a benchmark for evaluating psychiatric diagnostic decision-making in large language models (LLMs). Existing mental health benchmarks largely rely on social media data, limiting their ability to assess DSM-grounded diagnostic judgments. At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. Using MentalKG as a golden-standard logical backbone, we generate 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity, enabling low-noise and interpretable evaluation. Our experiments show that while state-of-the-art LLMs perform well on structured queries probing DSM-5 knowledge, they struggle to calibrate confidence in diagnostic decision-making when distinguishing between clinically overlapping disorders. These findings reveal evaluation gaps not captured by existing benchmarks.

MentalBench: A Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models

TL;DR

The experiments show that while state-of-the-art LLMs perform well on structured queries probing DSM-5 knowledge, they struggle to calibrate confidence in diagnostic decision-making when distinguishing between clinically overlapping disorders, revealing evaluation gaps not captured by existing benchmarks.

Abstract

We introduce MentalBench, a benchmark for evaluating psychiatric diagnostic decision-making in large language models (LLMs). Existing mental health benchmarks largely rely on social media data, limiting their ability to assess DSM-grounded diagnostic judgments. At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. Using MentalKG as a golden-standard logical backbone, we generate 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity, enabling low-noise and interpretable evaluation. Our experiments show that while state-of-the-art LLMs perform well on structured queries probing DSM-5 knowledge, they struggle to calibrate confidence in diagnostic decision-making when distinguishing between clinically overlapping disorders. These findings reveal evaluation gaps not captured by existing benchmarks.
Paper Structure (69 sections, 2 equations, 14 figures, 20 tables)

This paper contains 69 sections, 2 equations, 14 figures, 20 tables.

Figures (14)

  • Figure 1: Overview of the knowledge graph schema. The framework models the hierarchical and relational dependencies between disorders, symptom groups, and symptoms, alongside directional differential diagnoses. The resulting graph comprises 23 Disorder nodes, 23 Symptom Group nodes, 84 Symptom nodes, and 87 Differential Diagnosis nodes. These entities are interconnected via mandatory edges, included-in (hierarchical) edges, and differential diagnosis edges, forming a dense network of clinical dependencies.
  • Figure 2: Overview of the clinical case generation framework for constructing MentalBench. Our pipeline consists of three stages: (1) Symptom Profile Construction, where concrete symptom configurations are sampled from MentalKG; (2) Patient Profile Instantiation, which assigns demographic and specific symptom manifestations; and (3) Clinical Case Generation, which synthesizes clinical cases based on patient profiles. These processes are applied under two clinical scenarios: Single-Disease Identification and Differential Diagnosis.
  • Figure 3: Examples of generated clinical cases for differential diagnosis between Major Depressive Disorder (MDD) and Bipolar II Disorder. Ambiguous cases integrate triggering conditions into base profiles, while Unambiguous cases further apply discriminating rules to yield a definitive diagnosis.
  • Figure 4: Heatmap of diagnostic accuracy across 23 mental disorders for each model, for Type 2
  • Figure 5: Confusion mapping between ground truth diagnoses (left) and model predictions (right) for Type 4
  • ...and 9 more figures