Table of Contents
Fetching ...

Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search

Haoran Sun, Yankai Jiang, Wenjie Lou, Yujie Zhang, Wenjie Li, Lilong Wang, Mianxin Liu, Lei Liu, Xiaosong Wang

TL;DR

This work tackles the challenge of enabling generalized multimodal medical reasoning in LLMs by introducing Mentor-Intern Collaborative Search (MICS) to generate high-quality chain-of-thought data. It constructs the MMRP dataset with text QA, image–text alignment, and MICS-generated multimodal CoT data across multiple medical modalities, then trains Chiron-o1 via a three-stage curriculum to emerge reasoning capabilities. Empirical results show Chiron-o1 achieving state-of-the-art performance on medical VQA and reasoning benchmarks, including strong out-of-domain generalization and high-quality reasoning paths (MICS-Score). The approach improves medical interpretability and diagnostic reasoning while highlighting the potential and challenges of scalable CoT data construction for clinical AI applications.

Abstract

Multimodal large language models (MLLMs) have begun to demonstrate robust reasoning capabilities on general tasks, yet their application in the medical domain remains in its early stages. Constructing chain-of-thought (CoT) training data is essential for bolstering the reasoning abilities of medical MLLMs. However, existing approaches exhibit a deficiency in offering a comprehensive framework for searching and evaluating effective reasoning paths towards critical diagnosis. To address this challenge, we propose Mentor-Intern Collaborative Search (MICS), a novel reasoning-path searching scheme to generate rigorous and effective medical CoT data. MICS first leverages mentor models to initialize the reasoning, one step at a time, then prompts each intern model to continue the thinking along those initiated paths, and finally selects the optimal reasoning path according to the overall reasoning performance of multiple intern models. The reasoning performance is determined by an MICS-Score, which assesses the quality of generated reasoning paths. Eventually, we construct MMRP, a multi-task medical reasoning dataset with ranked difficulty, and Chiron-o1, a new medical MLLM devised via a curriculum learning strategy, with robust visual question-answering and generalizable reasoning capabilities. Extensive experiments demonstrate that Chiron-o1, trained on our CoT dataset constructed using MICS, achieves state-of-the-art performance across a list of medical visual question answering and reasoning benchmarks. Codes are available at https://github.com/manglu097/Chiron-o1

Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search

TL;DR

This work tackles the challenge of enabling generalized multimodal medical reasoning in LLMs by introducing Mentor-Intern Collaborative Search (MICS) to generate high-quality chain-of-thought data. It constructs the MMRP dataset with text QA, image–text alignment, and MICS-generated multimodal CoT data across multiple medical modalities, then trains Chiron-o1 via a three-stage curriculum to emerge reasoning capabilities. Empirical results show Chiron-o1 achieving state-of-the-art performance on medical VQA and reasoning benchmarks, including strong out-of-domain generalization and high-quality reasoning paths (MICS-Score). The approach improves medical interpretability and diagnostic reasoning while highlighting the potential and challenges of scalable CoT data construction for clinical AI applications.

Abstract

Multimodal large language models (MLLMs) have begun to demonstrate robust reasoning capabilities on general tasks, yet their application in the medical domain remains in its early stages. Constructing chain-of-thought (CoT) training data is essential for bolstering the reasoning abilities of medical MLLMs. However, existing approaches exhibit a deficiency in offering a comprehensive framework for searching and evaluating effective reasoning paths towards critical diagnosis. To address this challenge, we propose Mentor-Intern Collaborative Search (MICS), a novel reasoning-path searching scheme to generate rigorous and effective medical CoT data. MICS first leverages mentor models to initialize the reasoning, one step at a time, then prompts each intern model to continue the thinking along those initiated paths, and finally selects the optimal reasoning path according to the overall reasoning performance of multiple intern models. The reasoning performance is determined by an MICS-Score, which assesses the quality of generated reasoning paths. Eventually, we construct MMRP, a multi-task medical reasoning dataset with ranked difficulty, and Chiron-o1, a new medical MLLM devised via a curriculum learning strategy, with robust visual question-answering and generalizable reasoning capabilities. Extensive experiments demonstrate that Chiron-o1, trained on our CoT dataset constructed using MICS, achieves state-of-the-art performance across a list of medical visual question answering and reasoning benchmarks. Codes are available at https://github.com/manglu097/Chiron-o1

Paper Structure

This paper contains 33 sections, 9 equations, 25 figures, 5 tables.

Figures (25)

  • Figure 1: Overview of the MMRP Dataset and Chiron-o1 Performance. (a) The MMRP dataset encompasses 12 imaging modalities and 20 body systems. (b) Chiron-o1 achieves SOTA performance across various benchmarks compared to existing multimodal medical models.
  • Figure 2: Framework of the MICS Strategy. MICS enables search for effective reasoning paths through collaboration between mentor and intern models until the maximum search depth is reached or early-stopping conditions are met. $\theta$ denotes the mentor model, and $\beta$ denotes the intern model. The example of CoT construction using MICS is provided in the Figure \ref{['figure_app2']}.
  • Figure 3: Case Study on the MMRP Test Set. Compared to other multimodal medical reasoning models, Chiron-o1-8B demonstrates the ability to generate deep and reasonable reasoning paths, leading to correct answers. Due to page limitations, details are provided in the Appendix \ref{['Qualitative Analysis of Medical Reasoning Models']}.
  • Figure 4: Ablation Studies on MICS. Contribution of the MICS strategy to reasoning path score trends, with a, b, and c denoting three clinical scenarios (Appendix \ref{['Data Collection']}). "vanilla" refers to directly generating reasoning paths using the mentor model without evaluation.
  • Figure 5: Ablation studies on the Model Training Strategy. Figure (a) and (b) present results for Chiron-o1-8B and Chiron-o1-2B, respectively. The comparison highlights the advantage of the proposed stage-wise curriculum over alternative training schemes.
  • ...and 20 more figures