Table of Contents
Fetching ...

MedVLThinker: Simple Baselines for Multimodal Medical Reasoning

Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, Yuyin Zhou

TL;DR

This paper presents MedVLThinker, a suite of simple yet strong baselines for building reasoning-centric medical LMMs, and establishes a new state-of-the-art on existing public VQA benchmarks, surpassing all previous open-source medical LMMs.

Abstract

Large Reasoning Models (LRMs) have introduced a new paradigm in AI by enabling models to ``think before responding" via chain-of-thought reasoning. However, the absence of open and reproducible recipes for building reasoning-centric medical LMMs hinders community-wide research, analysis, and comparison. In this paper, we present MedVLThinker, a suite of simple yet strong baselines. Our fully open recipe consists of: (1) systematic data curation for both text-only and image-text medical data, filtered according to varying levels of reasoning difficulty, and (2) two training paradigms: Supervised Fine-Tuning (SFT) on distilled reasoning traces and Reinforcement Learning with Verifiable Rewards (RLVR) based on final answer correctness. Across extensive experiments on the Qwen2.5-VL model family (3B, 7B) and six medical QA benchmarks, we find that RLVR consistently and significantly outperforms SFT. Additionally, under the RLVR framework, a key, counter-intuitive finding is that training on our curated text-only reasoning data provides a more substantial performance boost than training on multimodal image-text data. Our best open 7B model, trained using the RLVR recipe on text-only data, establishes a new state-of-the-art on existing public VQA benchmarks, surpassing all previous open-source medical LMMs. Furthermore, scaling our model to 32B achieves performance on par with the proprietary GPT-4o. We release all curated data, models, and code to provide the community with a strong, open foundation for future research in multimodal medical reasoning.

MedVLThinker: Simple Baselines for Multimodal Medical Reasoning

TL;DR

This paper presents MedVLThinker, a suite of simple yet strong baselines for building reasoning-centric medical LMMs, and establishes a new state-of-the-art on existing public VQA benchmarks, surpassing all previous open-source medical LMMs.

Abstract

Large Reasoning Models (LRMs) have introduced a new paradigm in AI by enabling models to ``think before responding" via chain-of-thought reasoning. However, the absence of open and reproducible recipes for building reasoning-centric medical LMMs hinders community-wide research, analysis, and comparison. In this paper, we present MedVLThinker, a suite of simple yet strong baselines. Our fully open recipe consists of: (1) systematic data curation for both text-only and image-text medical data, filtered according to varying levels of reasoning difficulty, and (2) two training paradigms: Supervised Fine-Tuning (SFT) on distilled reasoning traces and Reinforcement Learning with Verifiable Rewards (RLVR) based on final answer correctness. Across extensive experiments on the Qwen2.5-VL model family (3B, 7B) and six medical QA benchmarks, we find that RLVR consistently and significantly outperforms SFT. Additionally, under the RLVR framework, a key, counter-intuitive finding is that training on our curated text-only reasoning data provides a more substantial performance boost than training on multimodal image-text data. Our best open 7B model, trained using the RLVR recipe on text-only data, establishes a new state-of-the-art on existing public VQA benchmarks, surpassing all previous open-source medical LMMs. Furthermore, scaling our model to 32B achieves performance on par with the proprietary GPT-4o. We release all curated data, models, and code to provide the community with a strong, open foundation for future research in multimodal medical reasoning.

Paper Structure

This paper contains 28 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: MedVLThinker provides a simple yet strong baseline for multimodal medical reasoning. Notably, MedVLThinker-32B yields performance on par with the closed-source GPT-4o model.
  • Figure 2: The data filtering and training pipeline. (A) We first filter both text-only m23k dataset and image-text PMC-VQA dataset, by generating multiple answers per question with Qwen2.5-VL-Instruct. Then we filter those questions are answered all wrong or almost correct. (B) Based on the filtered two datasets, we conduct supervised finetuning (SFT), reinforcement learning with verfiable rewaresd (RLVR), and their combination to train a herd of multimodal medical large reasoning models.
  • Figure 3: Probing the questions difficulty with Qwen2.5-VL-Instruct. For each question, we generate 16 answers. Then we draw the pie plots for the pass count. When the scale of the multimodal LLM increased, the number of high pass count questions increased. This indicate the potential of the models, especially for latter RLVR training, which encourage the models improve this possibility to answer questions correctly. The pass count are used for latter data filtering.
  • Figure 4: Case study on multiple medical VQA benchmarks with our 32B text-only RLVR model. Out MedVLThinker demonstrates robust reasoning capability across various imaging modalities.