Table of Contents
Fetching ...

MedCoT: Medical Chain of Thought via Hierarchical Expert

Jiaxiang Liu, Yuan Wang, Jiawei Du, Joey Tianyi Zhou, Zuozhu Liu

TL;DR

MedCoT tackles the lack of interpretable reasoning and robustness in Med-VQA by introducing a hierarchical expert verification pipeline that cascades through an Initial Specialist, a Follow-up Specialist, and a Diagnostic Specialist empowered by a sparse MoE. The approach couples step-by-step multimodal reasoning with a multimodal T5 backbone to produce not only answers but justifications, validated through expert voting to improve accuracy. Empirical results on four standard Med-VQA datasets show state-of-the-art performance and enhanced interpretability, with significant gains over strong baselines and clear reasoning traces. This work enhances clinical trust and diagnostic reliability by embedding explicit reasoning paths and multi-expert consensus into medical visual question answering.

Abstract

Artificial intelligence has advanced in Medical Visual Question Answering (Med-VQA), but prevalent research tends to focus on the accuracy of the answers, often overlooking the reasoning paths and interpretability, which are crucial in clinical settings. Besides, current Med-VQA algorithms, typically reliant on singular models, lack the robustness needed for real-world medical diagnostics which usually require collaborative expert evaluation. To address these shortcomings, this paper presents MedCoT, a novel hierarchical expert verification reasoning chain method designed to enhance interpretability and accuracy in biomedical imaging inquiries. MedCoT is predicated on two principles: The necessity for explicit reasoning paths in Med-VQA and the requirement for multi-expert review to formulate accurate conclusions. The methodology involves an Initial Specialist proposing diagnostic rationales, followed by a Follow-up Specialist who validates these rationales, and finally, a consensus is reached through a vote among a sparse Mixture of Experts within the locally deployed Diagnostic Specialist, which then provides the definitive diagnosis. Experimental evaluations on four standard Med-VQA datasets demonstrate that MedCoT surpasses existing state-of-the-art approaches, providing significant improvements in performance and interpretability.

MedCoT: Medical Chain of Thought via Hierarchical Expert

TL;DR

MedCoT tackles the lack of interpretable reasoning and robustness in Med-VQA by introducing a hierarchical expert verification pipeline that cascades through an Initial Specialist, a Follow-up Specialist, and a Diagnostic Specialist empowered by a sparse MoE. The approach couples step-by-step multimodal reasoning with a multimodal T5 backbone to produce not only answers but justifications, validated through expert voting to improve accuracy. Empirical results on four standard Med-VQA datasets show state-of-the-art performance and enhanced interpretability, with significant gains over strong baselines and clear reasoning traces. This work enhances clinical trust and diagnostic reliability by embedding explicit reasoning paths and multi-expert consensus into medical visual question answering.

Abstract

Artificial intelligence has advanced in Medical Visual Question Answering (Med-VQA), but prevalent research tends to focus on the accuracy of the answers, often overlooking the reasoning paths and interpretability, which are crucial in clinical settings. Besides, current Med-VQA algorithms, typically reliant on singular models, lack the robustness needed for real-world medical diagnostics which usually require collaborative expert evaluation. To address these shortcomings, this paper presents MedCoT, a novel hierarchical expert verification reasoning chain method designed to enhance interpretability and accuracy in biomedical imaging inquiries. MedCoT is predicated on two principles: The necessity for explicit reasoning paths in Med-VQA and the requirement for multi-expert review to formulate accurate conclusions. The methodology involves an Initial Specialist proposing diagnostic rationales, followed by a Follow-up Specialist who validates these rationales, and finally, a consensus is reached through a vote among a sparse Mixture of Experts within the locally deployed Diagnostic Specialist, which then provides the definitive diagnosis. Experimental evaluations on four standard Med-VQA datasets demonstrate that MedCoT surpasses existing state-of-the-art approaches, providing significant improvements in performance and interpretability.

Paper Structure

This paper contains 18 sections, 6 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: The upper figure shows a comparison of the outputs from the previous Med-VQA method and MedCoT, as well as the previous techniques in MMCoT zhang2023multimodal versus Sparse MoE in MedCoT. The lower figure demonstrates that MedCoT, with a model size of 256M parameters, outperforms the 7B parameter LLaVA-Med by 5.52% and 4.09% (Accuracy) on the VQA-RAD and SLAKE-EN datasets.
  • Figure 2: The MedCoT pipeline begins with an Initial Specialist receiving a medical question and image to generate a preliminary rationale. This rationale may have flaws (indicated in red), which are then reviewed by the Follow-up Specialist. If the rationale is deemed effective, it is retained; otherwise, it is reconsidered and a new rationale (indicated in green) is generated, along with an image caption. These elements are then integrated into the Diagnostic Specialist. Informed by all contexts, the Diagnostic Specialist, a multimodal language model with a designed sparse MoE structure, delivers the final diagnostic outcome (answer).
  • Figure 3: Diagnostic Specialist Pipeline. After passing through a visual encoder, medical images yield visual features. Contextual textual information—including captions, rationales, and options—is processed by a text encoder to obtain textual features. These are then subjected to cross-attention for feature integration, producing combined features. These integrated features, along with textual features, are input into a Sparse MoE structure. Here, multiple specialized experts thoroughly understand the intents of both the image and text. The insights are then fed into a textual decoder, which decodes the information to produce the final answer.
  • Figure 4: MedCoT is compared with various SoTA methods on closed questions on the VQA-RAD and SLAKE-EN datasets. MedCoT not only achieves SoTA accuracy in answers but also provides reasoning paths (rationale). The metric used is Accuracy (%).
  • Figure 5: The MedCoT pipeline begins with an Initial Specialist receiving a medical question and image to generate a preliminary rationale. This rationale may have flaws (indicated in red), which are then reviewed by the Follow-up Specialist. If the rationale is deemed effective, it is retained; otherwise, it is reconsidered and a new rationale (indicated in green) is generated, along with an image caption. These elements are then integrated into the Diagnostic Specialist. Informed by all context, the Diagnostic Specialist, a multimodal language model with a designed sparse MoE structure, delivers the final diagnostic outcome (answer).
  • ...and 2 more figures