Table of Contents
Fetching ...

Leveraging Evidence-Guided LLMs to Enhance Trustworthy Depression Diagnosis

Yining Yuan, J. Ben Tamo, Micky C. Nnamdi, Yifei Wang, May D. Wang

TL;DR

This work tackles the trust and interpretability gap in LLM-based depression diagnosis by proposing a two-stage framework: Evidence-Guided Diagnostic Reasoning (EGDR) and Diagnosis Confidence Scoring (DCS). EGDR grounds reasoning in a DSM-5 knowledge graph and interleaves evidence extraction with clinical criteria to produce structured diagnostic hypotheses, while DCS jointly evaluates factual alignment (KAS) and global logical consistency (LCS) to yield a normalized confidence score. Across the D4 and MDDial datasets, EGDR consistently improves diagnostic accuracy and confidence scores across diverse LLMs, demonstrating enhanced transparency and clinical fidelity. The approach offers a clinically grounded, interpretable foundation for trustworthy AI-assisted depression diagnosis, with potential for safer integration into clinical workflows.

Abstract

Large language models (LLMs) show promise in automating clinical diagnosis, yet their non-transparent decision-making and limited alignment with diagnostic standards hinder trust and clinical adoption. We address this challenge by proposing a two-stage diagnostic framework that enhances transparency, trustworthiness, and reliability. First, we introduce Evidence-Guided Diagnostic Reasoning (EGDR), which guides LLMs to generate structured diagnostic hypotheses by interleaving evidence extraction with logical reasoning grounded in DSM-5 criteria. Second, we propose a Diagnosis Confidence Scoring (DCS) module that evaluates the factual accuracy and logical consistency of generated diagnoses through two interpretable metrics: the Knowledge Attribution Score (KAS) and the Logic Consistency Score (LCS). Evaluated on the D4 dataset with pseudo-labels, EGDR outperforms direct in-context prompting and Chain-of-Thought (CoT) across five LLMs. For instance, on OpenBioLLM, EGDR improves accuracy from 0.31 (Direct) to 0.76 and increases DCS from 0.50 to 0.67. On MedLlama, DCS rises from 0.58 (CoT) to 0.77. Overall, EGDR yields up to +45% accuracy and +36% DCS gains over baseline methods, offering a clinically grounded, interpretable foundation for trustworthy AI-assisted diagnosis.

Leveraging Evidence-Guided LLMs to Enhance Trustworthy Depression Diagnosis

TL;DR

This work tackles the trust and interpretability gap in LLM-based depression diagnosis by proposing a two-stage framework: Evidence-Guided Diagnostic Reasoning (EGDR) and Diagnosis Confidence Scoring (DCS). EGDR grounds reasoning in a DSM-5 knowledge graph and interleaves evidence extraction with clinical criteria to produce structured diagnostic hypotheses, while DCS jointly evaluates factual alignment (KAS) and global logical consistency (LCS) to yield a normalized confidence score. Across the D4 and MDDial datasets, EGDR consistently improves diagnostic accuracy and confidence scores across diverse LLMs, demonstrating enhanced transparency and clinical fidelity. The approach offers a clinically grounded, interpretable foundation for trustworthy AI-assisted depression diagnosis, with potential for safer integration into clinical workflows.

Abstract

Large language models (LLMs) show promise in automating clinical diagnosis, yet their non-transparent decision-making and limited alignment with diagnostic standards hinder trust and clinical adoption. We address this challenge by proposing a two-stage diagnostic framework that enhances transparency, trustworthiness, and reliability. First, we introduce Evidence-Guided Diagnostic Reasoning (EGDR), which guides LLMs to generate structured diagnostic hypotheses by interleaving evidence extraction with logical reasoning grounded in DSM-5 criteria. Second, we propose a Diagnosis Confidence Scoring (DCS) module that evaluates the factual accuracy and logical consistency of generated diagnoses through two interpretable metrics: the Knowledge Attribution Score (KAS) and the Logic Consistency Score (LCS). Evaluated on the D4 dataset with pseudo-labels, EGDR outperforms direct in-context prompting and Chain-of-Thought (CoT) across five LLMs. For instance, on OpenBioLLM, EGDR improves accuracy from 0.31 (Direct) to 0.76 and increases DCS from 0.50 to 0.67. On MedLlama, DCS rises from 0.58 (CoT) to 0.77. Overall, EGDR yields up to +45% accuracy and +36% DCS gains over baseline methods, offering a clinically grounded, interpretable foundation for trustworthy AI-assisted diagnosis.

Paper Structure

This paper contains 16 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of the two-stage Evidence-Guided Diagnostic Reasoning (EGDR) and Diagnosis Confidence Scoring (DCS) framework. A DSM-5-based knowledge graph is constructed from diagnostic manuals using medical triplets. Stage 1 (EGDR) processes multi-turn patient dialogues to extract symptoms, retrieve relevant criteria, and generate diagnosis with reasoning. Stage 2 computes two evaluation scores: Knowledge Attribution Score (KAS) for semantic alignment with DSM-5 triplets, and Logic Consistency Score (LCS) for rule-based diagnostic validity. These are aggregated into a final Diagnosis Confidence Score (DCS) ranging from 0 to 1.
  • Figure 2: EGDR Framework Overview: (1) An AI assistant extracts patient symptoms from doctor-patient dialogue; (2) Matches symptoms to top-3 candidate disorders via the DSM-5 knowledge graph; (3) Evaluates whether diagnostic criteria are met; (4) Checks exclusion criteria; (5) Determines the final diagnosis and provides reasoning.
  • Figure 3: Depression and suicidal risk prediction result. GPT-4o-mini performance approaches D4 baseline
  • Figure 4: Diagnosis Confidence Scores (DCS) by correctness under different prompting strategies.
  • Figure 5: DCS evaluation of diagnostic reasoning for an invalid MDD case. The top section shows a Knowledge Attribution Score (KAS) breakdown, revealing mostly weak alignment with DSM-5 diagnostic knowledge. The bottom section presents the logic consistency evaluation, which scores 0.0 due to missing core symptoms and insufficient symptom count. Together, these produce a low Diagnostic Consistency Score (DCS = 0.291).