Leveraging Evidence-Guided LLMs to Enhance Trustworthy Depression Diagnosis

Yining Yuan; J. Ben Tamo; Micky C. Nnamdi; Yifei Wang; May D. Wang

Leveraging Evidence-Guided LLMs to Enhance Trustworthy Depression Diagnosis

Yining Yuan, J. Ben Tamo, Micky C. Nnamdi, Yifei Wang, May D. Wang

TL;DR

This work tackles the trust and interpretability gap in LLM-based depression diagnosis by proposing a two-stage framework: Evidence-Guided Diagnostic Reasoning (EGDR) and Diagnosis Confidence Scoring (DCS). EGDR grounds reasoning in a DSM-5 knowledge graph and interleaves evidence extraction with clinical criteria to produce structured diagnostic hypotheses, while DCS jointly evaluates factual alignment (KAS) and global logical consistency (LCS) to yield a normalized confidence score. Across the D4 and MDDial datasets, EGDR consistently improves diagnostic accuracy and confidence scores across diverse LLMs, demonstrating enhanced transparency and clinical fidelity. The approach offers a clinically grounded, interpretable foundation for trustworthy AI-assisted depression diagnosis, with potential for safer integration into clinical workflows.

Abstract

Large language models (LLMs) show promise in automating clinical diagnosis, yet their non-transparent decision-making and limited alignment with diagnostic standards hinder trust and clinical adoption. We address this challenge by proposing a two-stage diagnostic framework that enhances transparency, trustworthiness, and reliability. First, we introduce Evidence-Guided Diagnostic Reasoning (EGDR), which guides LLMs to generate structured diagnostic hypotheses by interleaving evidence extraction with logical reasoning grounded in DSM-5 criteria. Second, we propose a Diagnosis Confidence Scoring (DCS) module that evaluates the factual accuracy and logical consistency of generated diagnoses through two interpretable metrics: the Knowledge Attribution Score (KAS) and the Logic Consistency Score (LCS). Evaluated on the D4 dataset with pseudo-labels, EGDR outperforms direct in-context prompting and Chain-of-Thought (CoT) across five LLMs. For instance, on OpenBioLLM, EGDR improves accuracy from 0.31 (Direct) to 0.76 and increases DCS from 0.50 to 0.67. On MedLlama, DCS rises from 0.58 (CoT) to 0.77. Overall, EGDR yields up to +45% accuracy and +36% DCS gains over baseline methods, offering a clinically grounded, interpretable foundation for trustworthy AI-assisted diagnosis.

Leveraging Evidence-Guided LLMs to Enhance Trustworthy Depression Diagnosis

TL;DR

Abstract

Leveraging Evidence-Guided LLMs to Enhance Trustworthy Depression Diagnosis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)