Table of Contents
Fetching ...

CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis

Junying Chen, Chi Gui, Anningzhe Gao, Ke Ji, Xidong Wang, Xiang Wan, Benyou Wang

TL;DR

<3-5 sentence high-level summary> The paper tackles interpretability in LLM-based medical diagnosis by introducing Chain of Diagnosis (CoD), which outputs a transparent diagnostic chain and a disease confidence distribution to enable entropy-based symptom inquiry and controllable decisions. It builds DiagnosisGPT by fine-tuning on 48,020 synthetic CoD cases generated from a 9,604-disease knowledge base, achieving diagnosis across 9,604 diseases and demonstrating strong performance and interpretability on multiple benchmarks, including the new DxBench real-world dataset. The approach combines a disease retriever, confidence-driven decision making, and entropy-guided inquiries to improve both transparency and diagnostic rigor. The work provides a scalable framework for evaluating medical LLMs with open-ended consultations and offers a practical benchmark (DxBench) to simulate real-world clinical diagnostics.

Abstract

The field of medical diagnosis has undergone a significant transformation with the advent of large language models (LLMs), yet the challenges of interpretability within these models remain largely unaddressed. This study introduces Chain-of-Diagnosis (CoD) to enhance the interpretability of LLM-based medical diagnostics. CoD transforms the diagnostic process into a diagnostic chain that mirrors a physician's thought process, providing a transparent reasoning pathway. Additionally, CoD outputs the disease confidence distribution to ensure transparency in decision-making. This interpretability makes model diagnostics controllable and aids in identifying critical symptoms for inquiry through the entropy reduction of confidences. With CoD, we developed DiagnosisGPT, capable of diagnosing 9604 diseases. Experimental results demonstrate that DiagnosisGPT outperforms other LLMs on diagnostic benchmarks. Moreover, DiagnosisGPT provides interpretability while ensuring controllability in diagnostic rigor.

CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis

TL;DR

<3-5 sentence high-level summary> The paper tackles interpretability in LLM-based medical diagnosis by introducing Chain of Diagnosis (CoD), which outputs a transparent diagnostic chain and a disease confidence distribution to enable entropy-based symptom inquiry and controllable decisions. It builds DiagnosisGPT by fine-tuning on 48,020 synthetic CoD cases generated from a 9,604-disease knowledge base, achieving diagnosis across 9,604 diseases and demonstrating strong performance and interpretability on multiple benchmarks, including the new DxBench real-world dataset. The approach combines a disease retriever, confidence-driven decision making, and entropy-guided inquiries to improve both transparency and diagnostic rigor. The work provides a scalable framework for evaluating medical LLMs with open-ended consultations and offers a practical benchmark (DxBench) to simulate real-world clinical diagnostics.

Abstract

The field of medical diagnosis has undergone a significant transformation with the advent of large language models (LLMs), yet the challenges of interpretability within these models remain largely unaddressed. This study introduces Chain-of-Diagnosis (CoD) to enhance the interpretability of LLM-based medical diagnostics. CoD transforms the diagnostic process into a diagnostic chain that mirrors a physician's thought process, providing a transparent reasoning pathway. Additionally, CoD outputs the disease confidence distribution to ensure transparency in decision-making. This interpretability makes model diagnostics controllable and aids in identifying critical symptoms for inquiry through the entropy reduction of confidences. With CoD, we developed DiagnosisGPT, capable of diagnosing 9604 diseases. Experimental results demonstrate that DiagnosisGPT outperforms other LLMs on diagnostic benchmarks. Moreover, DiagnosisGPT provides interpretability while ensuring controllability in diagnostic rigor.
Paper Structure (47 sections, 10 equations, 23 figures, 9 tables)

This paper contains 47 sections, 10 equations, 23 figures, 9 tables.

Figures (23)

  • Figure 1: Example of the automatic diagnosis task, with sample data from midiag.
  • Figure 2: Left: Example of a CoD response. Right: Construction of CoD training data.
  • Figure 3: Schematic of constructing disease database and synthesizing patient cases.
  • Figure 4: Relationship between confidence and accuracy. We provided all symptoms ($\mathcal{S}_{\textbf{exp}}\cup \mathcal{S}_{\textbf{imp}}$) to DiagnosisGPT for direct disease diagnosis (without symptom inquiry). Diagnosis Accuracy represents the accuracy of diagnoses exceeding the threshold $\tau$. Diagnosis Rate indicates the proportion of data that exceed $\tau$, i.e., the proportion of cases where the model diagnosis.
  • Figure 5: Evaluation results of completeness. Disease Completeness denotes the percentage of analyses covering all diseases. Symptom Completeness denotes the percentage covering all patient symptoms. Left: We sampled 2k entries from CoD data with varied prompt-driven analyses evaluated by GPT-4. Right: We sampled 100 entries and conducted manual evaluations. See Appendix \ref{['COD_COT_EVALUATION']} for details.
  • ...and 18 more figures