MDD-5k: A New Diagnostic Conversation Dataset for Mental Disorders Synthesized via Neuro-Symbolic LLM Agents

Congchi Yin; Feng Li; Shu Zhang; Zike Wang; Jun Shao; Piji Li; Jianhua Chen; Xun Jiang

MDD-5k: A New Diagnostic Conversation Dataset for Mental Disorders Synthesized via Neuro-Symbolic LLM Agents

Congchi Yin, Feng Li, Shu Zhang, Zike Wang, Jun Shao, Piji Li, Jianhua Chen, Xun Jiang

TL;DR

This work tackles privacy‑driven barriers to collecting diagnostic conversations for mental disorders by proposing a neuro‑symbolic multi‑agent framework that synthesizes diagnostic dialogues from anonymized patient cases. Using a fixed SCID‑5–based symptom tree plus a dynamic experience inquiry tree, the system generates diverse, long, and professionally labeled Chinese diagnostic conversations, culminating in MDD‑5k, the largest open Chinese mental disorder diagnosis dataset with 5000 dialogues and clinician labels from 1000 cases. Human evaluation demonstrates high professionalism and realism relative to baselines, highlighting the framework's potential for downstream tasks like disorder classification and diagnostic support while recognizing limitations in realism and coverage. This dataset and methodology advance privacy‑preserving data synthesis for mental health AI and lay groundwork for broader disorder coverage and multilingual release.

Abstract

The clinical diagnosis of most mental disorders primarily relies on the conversations between psychiatrist and patient. The creation of such diagnostic conversation datasets is promising to boost the AI mental healthcare community. However, directly collecting the conversations in real diagnosis scenarios is near impossible due to stringent privacy and ethical considerations. To address this issue, we seek to synthesize diagnostic conversation by exploiting anonymized patient cases that are easier to access. Specifically, we design a neuro-symbolic multi-agent framework for synthesizing the diagnostic conversation of mental disorders with large language models. It takes patient case as input and is capable of generating multiple diverse conversations with one single patient case. The framework basically involves the interaction between a doctor agent and a patient agent, and generates conversations under symbolic control via a dynamic diagnosis tree. By applying the proposed framework, we develop the largest Chinese mental disorders diagnosis dataset MDD-5k. This dataset is built upon 1000 real, anonymized patient cases by cooperating with Shanghai Mental Health Center and comprises 5000 high-quality long conversations with diagnosis results and treatment opinions as labels. To the best of our knowledge, it's also the first labeled dataset for Chinese mental disorders diagnosis. Human evaluation demonstrates the proposed MDD-5k dataset successfully simulates human-like diagnostic process of mental disorders.

MDD-5k: A New Diagnostic Conversation Dataset for Mental Disorders Synthesized via Neuro-Symbolic LLM Agents

TL;DR

Abstract

MDD-5k: A New Diagnostic Conversation Dataset for Mental Disorders Synthesized via Neuro-Symbolic LLM Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (3)