Table of Contents
Fetching ...

MedCLM: Learning to Localize and Reason via a CoT-Curriculum in Medical Vision-Language Models

Soo Yong Kim, Suin Cho, Vincent-Daniel Yun, Gyeongyeon Hwang

TL;DR

MedCLM addresses the need for clinically interpretable AI in medical imaging by automatically generating large-scale VQA data enriched with Chain-of-Thought rationales. It links each lesion to its host organ to create anatomically grounded seeds and uses a three-stage Integrated CoT–Curriculum (Easy, Medium, Hard) to progressively ground visual evidence before performing reasoning under weak supervision. The approach yields state-of-the-art results on open-ended medical VQA benchmarks and improves radiology report generation, while maintaining interpretability through anatomically anchored CoT. This scalable pipeline reduces annotation bottlenecks and aligns medical vision-language models with clinical workflows, enabling robust, explainable diagnostic reasoning at scale.

Abstract

Bridging clinical diagnostic reasoning with AI remains a central challenge in medical imaging. We introduce MedCLM, an automated pipeline that converts detection datasets into large-scale medical visual question answering (VQA) data with Chain-of-Thought (CoT) reasoning by linking lesion boxes to organ segmentation and structured rationales. These contextual signals enable medical vision-language models to generate question-answer pairs with step-by-step reasoning. To utilize this data effectively, we propose an Integrated CoT-Curriculum Strategy composed of an Easy stage with explicit lesion boxes for visual grounding, a Medium stage that encourages implicit localization, and a Hard stage for weakly supervised reasoning. Experimental results demonstrate that MedCLM attains state-of-the-art performance on several medical VQA benchmarks, providing a scalable framework for developing clinically aligned medical vision-language models.

MedCLM: Learning to Localize and Reason via a CoT-Curriculum in Medical Vision-Language Models

TL;DR

MedCLM addresses the need for clinically interpretable AI in medical imaging by automatically generating large-scale VQA data enriched with Chain-of-Thought rationales. It links each lesion to its host organ to create anatomically grounded seeds and uses a three-stage Integrated CoT–Curriculum (Easy, Medium, Hard) to progressively ground visual evidence before performing reasoning under weak supervision. The approach yields state-of-the-art results on open-ended medical VQA benchmarks and improves radiology report generation, while maintaining interpretability through anatomically anchored CoT. This scalable pipeline reduces annotation bottlenecks and aligns medical vision-language models with clinical workflows, enabling robust, explainable diagnostic reasoning at scale.

Abstract

Bridging clinical diagnostic reasoning with AI remains a central challenge in medical imaging. We introduce MedCLM, an automated pipeline that converts detection datasets into large-scale medical visual question answering (VQA) data with Chain-of-Thought (CoT) reasoning by linking lesion boxes to organ segmentation and structured rationales. These contextual signals enable medical vision-language models to generate question-answer pairs with step-by-step reasoning. To utilize this data effectively, we propose an Integrated CoT-Curriculum Strategy composed of an Easy stage with explicit lesion boxes for visual grounding, a Medium stage that encourages implicit localization, and a Hard stage for weakly supervised reasoning. Experimental results demonstrate that MedCLM attains state-of-the-art performance on several medical VQA benchmarks, providing a scalable framework for developing clinically aligned medical vision-language models.

Paper Structure

This paper contains 37 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Automated Rationale-to-CoT Data Generation and Curriculum Fine-Tuning. Top: Detection datasets are converted into a VQA-CoT corpus via organ segmentation, rationale seed generation, and CoT-based QA synthesis. Bottom: Fine-tuning progresses from Explicit Localization (Easy), to Implicit Localization (Mid), and finally to Weakly-Supervised Reasoning (Hard), reducing cognitive load and improving visual grounding.
  • Figure 2: Qualitative comparison of model outputs on binary and descriptive medical VQA tasks. The first two rows show binary QA cases with and without explicit box references, where our method correctly identifies pathology while baselines fail in at least one instance. The third row shows a free-form description task on a chest X-ray: our model produces a clinically faithful report aligned with the reference, whereas LLaVA-Med++ introduces extraneous findings and MedVP-LLaVA omits key stability details.
  • Figure 3: Additional qualitative results (1/3).
  • Figure 4: Additional qualitative results (2/3).
  • Figure 5: Additional qualitative results (3/3).