LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis
Shihao Xu, Tiancheng Zhou, Jiatong Ma, Yanli Ding, Yiming Yan, Ming Xiao, Guoyi Li, Haiyang Geng, Yunyun Han, Jianhua Chen, Yafeng Deng
TL;DR
LingxiDiagBench introduces a comprehensive, agent-based benchmark for AI-assisted psychiatric diagnosis in Chinese, pairing real EMR data (LingxiDiag-Clinical) with a large synthetic corpus (LingxiDiag-16K) generated through a multi-agent framework spanning Patient, Doctor, and Diagnosis Agents across 12 ICD-10 categories. It provides static and dynamic evaluation paradigms to assess both diagnostic accuracy and real-time consultation quality, employing varied strategies including retrieval-augmented guidance (MRD-RAG) and APA-guided workflows. Across extensive experiments with state-of-the-art LLMs, the study reveals substantial gaps in comorbidity recognition and multi-class differential diagnosis, and highlights that high-quality consultation behavior does not automatically yield correct diagnoses. The framework and data release aim to standardize reproducible research, accelerate development of AI-assisted psychiatric tools, and inform safe, scalable deployment considerations in clinical workflows.
Abstract
Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.
