Table of Contents
Fetching ...

OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning

Timothy Ossowski, Sheng Zhang, Qianchu Liu, Guanghui Qin, Reuben Tan, Tristan Naumann, Junjie Hu, Hoifung Poon

TL;DR

OctoMed investigates data-centric strategies for robust multimodal medical reasoning and introduces a structured data recipe designed for large-scale SFT. By distilling reasoning traces from a strong teacher and using rejection sampling, OctoMed builds a dataset of over 8 million reasoning traces that spans text and medical images. The model achieves state-of-the-art open-source performance across diverse benchmarks and exhibits emergent task-aware reasoning, dynamically adjusting trace depth by task difficulty. The work underscores the central role of data design in medical vision-language systems and outlines future steps toward reinforcement-learning–augmented robustness.

Abstract

High-quality and carefully curated data is a cornerstone of training medical large language models, as it directly impacts both generalization and robustness to unseen clinical tasks. We investigate strategies for training and data curation to develop a robust multimodal reasoning model in the medical domain. Our work focuses on supervised fine-tuning (SFT) and explores data recipes that leverage structured reasoning traces. Using our proposed data recipe, we scale experiments to a dataset of over 8 million examples and 6.8 billion response tokens, achieving state-of-the-art performance among open-source models across diverse out-of-distribution medical benchmark tasks. Our results further indicate that curating a high-quality, diverse training dataset with varying structured reasoning trace lengths enables the fine-tuned model to self-calibrate its reasoning trajectory lengths based on the downstream task, without explicit supervision. We present key insights, describe the data curation strategy, and outline next steps toward developing robust medical vision-language reasoning system.

OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning

TL;DR

OctoMed investigates data-centric strategies for robust multimodal medical reasoning and introduces a structured data recipe designed for large-scale SFT. By distilling reasoning traces from a strong teacher and using rejection sampling, OctoMed builds a dataset of over 8 million reasoning traces that spans text and medical images. The model achieves state-of-the-art open-source performance across diverse benchmarks and exhibits emergent task-aware reasoning, dynamically adjusting trace depth by task difficulty. The work underscores the central role of data design in medical vision-language systems and outlines future steps toward reinforcement-learning–augmented robustness.

Abstract

High-quality and carefully curated data is a cornerstone of training medical large language models, as it directly impacts both generalization and robustness to unseen clinical tasks. We investigate strategies for training and data curation to develop a robust multimodal reasoning model in the medical domain. Our work focuses on supervised fine-tuning (SFT) and explores data recipes that leverage structured reasoning traces. Using our proposed data recipe, we scale experiments to a dataset of over 8 million examples and 6.8 billion response tokens, achieving state-of-the-art performance among open-source models across diverse out-of-distribution medical benchmark tasks. Our results further indicate that curating a high-quality, diverse training dataset with varying structured reasoning trace lengths enables the fine-tuned model to self-calibrate its reasoning trajectory lengths based on the downstream task, without explicit supervision. We present key insights, describe the data curation strategy, and outline next steps toward developing robust medical vision-language reasoning system.

Paper Structure

This paper contains 25 sections, 2 equations, 24 figures, 2 tables.

Figures (24)

  • Figure 1: Left: Average performance on 3 task types when finetuning a student model with various SFT datasets. All student models were initialized with the same Qwen2.5-VL-7B-Instruct checkpoint and compared to the student's performance before finetuning (dotted line). Right: Progress on MedQA performance over time. Despite its modest 7B parameter size, OctoMed outperforms strong open small-scale and large proprietary systems.
  • Figure 2: Overview of the SFT dataset. Left: Distribution of imaging modalities and anatomical regions represented in the SFT mixture. For large datasets in our mixture lacking modality and region annotations (e.g., PMC-VQA), we obtained this metadata by prompting GPT-4.1-mini. The percentages do not total 100% due to a minor fraction of samples from other less common modalities. Middle: Breakdown of task types and source datasets used for distillation. Right: Summary of key dataset statistics.
  • Figure 3: Average performance improvement across downstream task types when training on different question sources. Models perform best when trained on data that matches the downstream task type. Combining sources yields higher and more consistent improvements, suggesting that diverse data sources provide complementary knowledge that enhances generalization.
  • Figure 4: Effect of question filtering on PMC-VQA performance. All filtering strategies improve sample efficiency compared to the no-filtering baseline but have similar peak performance.
  • Figure 5: Effect of scaling rejection samples and training epochs on MedQA test set performance. Early improvements from additional rejection samples mirror the gains from training for more epochs. However, increasing the number of rejection samples per question consistently raises peak performance, with 16 samples achieving the highest final accuracy.
  • ...and 19 more figures