Table of Contents
Fetching ...

Large Language Models are Clinical Reasoners: Reasoning-Aware Diagnosis Framework with Prompt-Generated Rationales

Taeyoon Kwon, Kai Tzu-iunn Ong, Dongjin Kang, Seungjun Moon, Jeong Ryong Lee, Dosik Hwang, Yongsik Sim, Beomseok Sohn, Dongha Lee, Jinyoung Yeo

TL;DR

This work tackles the gap in clinical AI by introducing a reasoning-aware diagnosis framework that uses prompt-generated Clinical Chain-of-Thought rationales to guide disease diagnosis, demonstrated on Alzheimer's disease data. It combines a clinical rationalization module with few-shot CoT reasoning and distillation into unimodal and multimodal students, enabling accurate and data-efficient diagnosis. Human evaluation with radiologists validates the quality and clinical relevance of the generated rationales, while findings show that distilled models can outperform the original large models in many settings. The framework promises practical deployment of reasoning-enabled AI in clinical diagnostics with improved transparency and efficiency.

Abstract

Machine reasoning has made great progress in recent years owing to large language models (LLMs). In the clinical domain, however, most NLP-driven projects mainly focus on clinical classification or reading comprehension, and under-explore clinical reasoning for disease diagnosis due to the expensive rationale annotation with clinicians. In this work, we present a "reasoning-aware" diagnosis framework that rationalizes the diagnostic process via prompt-based learning in a time- and labor-efficient manner, and learns to reason over the prompt-generated rationales. Specifically, we address the clinical reasoning for disease diagnosis, where the LLM generates diagnostic rationales providing its insight on presented patient data and the reasoning path towards the diagnosis, namely Clinical Chain-of-Thought (Clinical CoT). We empirically demonstrate LLMs/LMs' ability of clinical reasoning via extensive experiments and analyses on both rationale generation and disease diagnosis in various settings. We further propose a novel set of criteria for evaluating machine-generated rationales' potential for real-world clinical settings, facilitating and benefiting future research in this area.

Large Language Models are Clinical Reasoners: Reasoning-Aware Diagnosis Framework with Prompt-Generated Rationales

TL;DR

This work tackles the gap in clinical AI by introducing a reasoning-aware diagnosis framework that uses prompt-generated Clinical Chain-of-Thought rationales to guide disease diagnosis, demonstrated on Alzheimer's disease data. It combines a clinical rationalization module with few-shot CoT reasoning and distillation into unimodal and multimodal students, enabling accurate and data-efficient diagnosis. Human evaluation with radiologists validates the quality and clinical relevance of the generated rationales, while findings show that distilled models can outperform the original large models in many settings. The framework promises practical deployment of reasoning-enabled AI in clinical diagnostics with improved transparency and efficiency.

Abstract

Machine reasoning has made great progress in recent years owing to large language models (LLMs). In the clinical domain, however, most NLP-driven projects mainly focus on clinical classification or reading comprehension, and under-explore clinical reasoning for disease diagnosis due to the expensive rationale annotation with clinicians. In this work, we present a "reasoning-aware" diagnosis framework that rationalizes the diagnostic process via prompt-based learning in a time- and labor-efficient manner, and learns to reason over the prompt-generated rationales. Specifically, we address the clinical reasoning for disease diagnosis, where the LLM generates diagnostic rationales providing its insight on presented patient data and the reasoning path towards the diagnosis, namely Clinical Chain-of-Thought (Clinical CoT). We empirically demonstrate LLMs/LMs' ability of clinical reasoning via extensive experiments and analyses on both rationale generation and disease diagnosis in various settings. We further propose a novel set of criteria for evaluating machine-generated rationales' potential for real-world clinical settings, facilitating and benefiting future research in this area.
Paper Structure (51 sections, 6 equations, 6 figures, 4 tables)

This paper contains 51 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Clinical reasoning in disease diagnosis.
  • Figure 2: An overview of our framework ($\mathcal{P}$: Patient description; $\mathcal{D}$: Diagnosis; $\mathcal{R}$: Clinical rationale).
  • Figure 3: Performance of student models trained with and without clinical rationales, reported on ADNI. The dotted line is the performance of the teacher LLM (GPT-4).
  • Figure 4: Data efficiency brought by clinical reasoning.
  • Figure 5: Analysis of rationales from GPT-4's misdiagnoses.
  • ...and 1 more figures