ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Deli Zhao, Wenbing Huang, Tingyang Xu, Qifeng Bai, Yu Rong
TL;DR
ReasonMed tackles the data scarcity and validation challenges in medical reasoning by introducing a large-scale, multi-agent generated dataset of 370k high-quality exemplars distilled from 1.291M validated reasoning trajectories. The authors design a MAS that generates diverse multi-step chain-of-thought paths, validates and refines them through a structured EMD pipeline, and explores training regimes that fuse detailed CoT with concise answer summaries. Empirical results show ReasonMed-7B achieves state-of-the-art performance among sub-10B models on PubMedQA and strong performance across multiple medical QA benchmarks, with further gains when scaling to 14B. The work provides a transferable, data-centric blueprint for constructing domain-specific reasoning datasets and demonstrates the value of explicit reasoning in medical QA, potentially narrowing the gap to much larger models while reducing training costs.
Abstract
Reasoning-based large language models have excelled in mathematics and programming, yet their potential in knowledge-intensive medical question answering remains underexplored and insufficiently validated in clinical contexts. To bridge this gap, we introduce ReasonMed, the largest medical reasoning dataset to date, comprising 370k high-quality examples distilled from 1.75 million initial reasoning paths generated by complementary LLMs and curated through a cost-efficient easy-medium-difficult (EMD) pipeline. ReasonMed is built through a multi-agent generation, verification, and refinement process, in which an Error Refiner improves reasoning paths by correcting error-prone steps identified by a verifier. Using ReasonMed, we investigate effective strategies for training medical reasoning models and find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine-tuning results. Models trained on ReasonMed set a new benchmark: ReasonMed-7B surpasses the prior best sub-10B models by 4.17% and even exceeds LLaMA3.1-70B on PubMedQA by 4.60%. When scaled to ReasonMed-14B, it remains highly competitive, underscoring consistent scaling potential. The codes and datasets are available at https://github.com/YuSun-Work/ReasonMed.
