CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis

Junying Chen; Chi Gui; Anningzhe Gao; Ke Ji; Xidong Wang; Xiang Wan; Benyou Wang

CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis

Junying Chen, Chi Gui, Anningzhe Gao, Ke Ji, Xidong Wang, Xiang Wan, Benyou Wang

TL;DR

<3-5 sentence high-level summary> The paper tackles interpretability in LLM-based medical diagnosis by introducing Chain of Diagnosis (CoD), which outputs a transparent diagnostic chain and a disease confidence distribution to enable entropy-based symptom inquiry and controllable decisions. It builds DiagnosisGPT by fine-tuning on 48,020 synthetic CoD cases generated from a 9,604-disease knowledge base, achieving diagnosis across 9,604 diseases and demonstrating strong performance and interpretability on multiple benchmarks, including the new DxBench real-world dataset. The approach combines a disease retriever, confidence-driven decision making, and entropy-guided inquiries to improve both transparency and diagnostic rigor. The work provides a scalable framework for evaluating medical LLMs with open-ended consultations and offers a practical benchmark (DxBench) to simulate real-world clinical diagnostics.

Abstract

The field of medical diagnosis has undergone a significant transformation with the advent of large language models (LLMs), yet the challenges of interpretability within these models remain largely unaddressed. This study introduces Chain-of-Diagnosis (CoD) to enhance the interpretability of LLM-based medical diagnostics. CoD transforms the diagnostic process into a diagnostic chain that mirrors a physician's thought process, providing a transparent reasoning pathway. Additionally, CoD outputs the disease confidence distribution to ensure transparency in decision-making. This interpretability makes model diagnostics controllable and aids in identifying critical symptoms for inquiry through the entropy reduction of confidences. With CoD, we developed DiagnosisGPT, capable of diagnosing 9604 diseases. Experimental results demonstrate that DiagnosisGPT outperforms other LLMs on diagnostic benchmarks. Moreover, DiagnosisGPT provides interpretability while ensuring controllability in diagnostic rigor.

CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis

TL;DR

Abstract

Paper Structure (47 sections, 10 equations, 23 figures, 9 tables)

This paper contains 47 sections, 10 equations, 23 figures, 9 tables.

Introduction
Preliminaries
Problem definition
The Challenge for LLM
The Philosophy of CoD for Interpretability
Methodology: Chain of Diagnosis
The Diagnostic Chain
CoD as an Entropy-reduction Process
Experiments
Model Training & Setup
Benchmarking Settings
Traditional baselines (Non-LLM)
LLM baselines
LLM Evaluation
Benchmarks
...and 32 more sections

Figures (23)

Figure 1: Example of the automatic diagnosis task, with sample data from midiag.
Figure 2: Left: Example of a CoD response. Right: Construction of CoD training data.
Figure 3: Schematic of constructing disease database and synthesizing patient cases.
Figure 4: Relationship between confidence and accuracy. We provided all symptoms ($\mathcal{S}_{\textbf{exp}}\cup \mathcal{S}_{\textbf{imp}}$) to DiagnosisGPT for direct disease diagnosis (without symptom inquiry). Diagnosis Accuracy represents the accuracy of diagnoses exceeding the threshold $\tau$. Diagnosis Rate indicates the proportion of data that exceed $\tau$, i.e., the proportion of cases where the model diagnosis.
Figure 5: Evaluation results of completeness. Disease Completeness denotes the percentage of analyses covering all diseases. Symptom Completeness denotes the percentage covering all patient symptoms. Left: We sampled 2k entries from CoD data with varied prompt-driven analyses evaluated by GPT-4. Right: We sampled 100 entries and conducted manual evaluations. See Appendix \ref{['COD_COT_EVALUATION']} for details.
...and 18 more figures

CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis

TL;DR

Abstract

CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis

Authors

TL;DR

Abstract

Table of Contents

Figures (23)