Table of Contents
Fetching ...

Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering

Lin Fan, Yafei Ou, Zhipeng Deng, Pengyu Dai, Hou Chongxian, Jiale Yan, Yaqian Li, Kaiwen Long, Xun Gong, Masayuki Ikebe, Yefeng Zheng

Abstract

Chain-of-thought (CoT) reasoning has advanced medical visual question answering (VQA), yet most existing CoT rationales are free-form and fail to capture the structured reasoning process clinicians actually follow. This work asks: Can traceable, multi-step reasoning supervision improve reasoning accuracy and the interpretability of Medical VQA? To this end, we introduce Step-CoT, a large-scale medical reasoning dataset with expert-curated, structured multi-step CoT aligned to clinical diagnostic workflows, implicitly grounding the model's reasoning in radiographic evidence. Step-CoT comprises more than 10K real clinical cases and 70K VQA pairs organized around diagnostic workflows, providing supervised intermediate steps that guide models to follow valid reasoning trajectories. To effectively learn from Step-CoT, we further introduce a teacher-student framework with a dynamic graph-structured focusing mechanism that prioritizes diagnostically informative steps while filtering out less relevant contexts. Our experiments show that using Step-CoT can improve reasoning accuracy and interpretability. Benchmark: github.com/hahaha111111/Step-CoT. Dataset Card: huggingface.co/datasets/fl-15o/Step-CoT

Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering

Abstract

Chain-of-thought (CoT) reasoning has advanced medical visual question answering (VQA), yet most existing CoT rationales are free-form and fail to capture the structured reasoning process clinicians actually follow. This work asks: Can traceable, multi-step reasoning supervision improve reasoning accuracy and the interpretability of Medical VQA? To this end, we introduce Step-CoT, a large-scale medical reasoning dataset with expert-curated, structured multi-step CoT aligned to clinical diagnostic workflows, implicitly grounding the model's reasoning in radiographic evidence. Step-CoT comprises more than 10K real clinical cases and 70K VQA pairs organized around diagnostic workflows, providing supervised intermediate steps that guide models to follow valid reasoning trajectories. To effectively learn from Step-CoT, we further introduce a teacher-student framework with a dynamic graph-structured focusing mechanism that prioritizes diagnostically informative steps while filtering out less relevant contexts. Our experiments show that using Step-CoT can improve reasoning accuracy and interpretability. Benchmark: github.com/hahaha111111/Step-CoT. Dataset Card: huggingface.co/datasets/fl-15o/Step-CoT
Paper Structure (52 sections, 25 equations, 6 figures, 11 tables)

This paper contains 52 sections, 25 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Overview of the Step-CoT dataset. (A) Conventional Med-VQA approaches, where models take an image and a question as input, perform multi-modal feature fusion and output a diagnostic answer. Although leveraging multi-modal knowledge, this paradigm lacks interpretability and often yields limited diagnostic accuracy. (B) Enhances interpretability by integrating large language models with CoT reasoning to generate intermediate explanations; however, such reasoning is often unreliable. (C) Our proposed Step-CoT dataset and training framework, which introduces explicit intermediate supervision. By guiding the model to learn structured clinical reasoning steps, Step-CoT not only improves interpretability through trustworthy intermediate reasoning but also enhances diagnostic accuracy.
  • Figure 2: Distribution and statistics for the data sources, disease prevalence, answer distributions, and reasoning lengths in the Step-CoT dataset. (A) The inner ring illustrates the proportional distribution across different datasets, while the outer ring represents the distribution of various disease categories within the datasets. (B) This confusion matrix, organized by disease categories and reasoning steps, visualizes the average reasoning chain length. Each cell contains a pie chart representing the statistical distribution of samples across different chain lengths, while the marginal histograms on the axes display the sample count distributions by chain length for individual steps (x-axis) and disease categories (y-axis). (C) This diagram presents the outcome transition statistics between consecutive reasoning steps, mapping the flow of diagnostic conclusions throughout the clinical reasoning pathway. The name of each annotation (e.g., A1) can be referred to in the dataset description section of Appendix Sec. B.
  • Figure 3: Fine-tuning an LVLM with Step-CoT-based intermediate constraints under verifiable instructions. The testing of different steps is conducted in independent dialogue sessions. The model progressively adjusts its stepwise reasoning, produces coherent intermediate steps, and converges to the correct final diagnosis; this demonstrates that Step-CoT’s structured intermediate constraints strengthen model reasoning and reliably guide it to accurate conclusions.
  • Figure 4: The feature attention visualization across multi-step reasoning demonstrates an evolution from broad attention in the initial query steps to highly targeted attention in the final diagnostic step, reflecting the multi-step capability of Step-CoT and visually verifying the effectiveness of the reasoning chain.
  • Figure 5: This study collected a total of 16,782 CXR samples in PNG format from three datasets, containing 3,999, 8,788, and 3,995 samples, respectively. After filtering, 10,068 samples were retained, yielding 10,068*7 QA pairs for training the stepwise Med-VQA task.
  • ...and 1 more figures