Table of Contents
Fetching ...

ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Deli Zhao, Wenbing Huang, Tingyang Xu, Qifeng Bai, Yu Rong

TL;DR

ReasonMed tackles the data scarcity and validation challenges in medical reasoning by introducing a large-scale, multi-agent generated dataset of 370k high-quality exemplars distilled from 1.291M validated reasoning trajectories. The authors design a MAS that generates diverse multi-step chain-of-thought paths, validates and refines them through a structured EMD pipeline, and explores training regimes that fuse detailed CoT with concise answer summaries. Empirical results show ReasonMed-7B achieves state-of-the-art performance among sub-10B models on PubMedQA and strong performance across multiple medical QA benchmarks, with further gains when scaling to 14B. The work provides a transferable, data-centric blueprint for constructing domain-specific reasoning datasets and demonstrates the value of explicit reasoning in medical QA, potentially narrowing the gap to much larger models while reducing training costs.

Abstract

Reasoning-based large language models have excelled in mathematics and programming, yet their potential in knowledge-intensive medical question answering remains underexplored and insufficiently validated in clinical contexts. To bridge this gap, we introduce ReasonMed, the largest medical reasoning dataset to date, comprising 370k high-quality examples distilled from 1.75 million initial reasoning paths generated by complementary LLMs and curated through a cost-efficient easy-medium-difficult (EMD) pipeline. ReasonMed is built through a multi-agent generation, verification, and refinement process, in which an Error Refiner improves reasoning paths by correcting error-prone steps identified by a verifier. Using ReasonMed, we investigate effective strategies for training medical reasoning models and find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine-tuning results. Models trained on ReasonMed set a new benchmark: ReasonMed-7B surpasses the prior best sub-10B models by 4.17% and even exceeds LLaMA3.1-70B on PubMedQA by 4.60%. When scaled to ReasonMed-14B, it remains highly competitive, underscoring consistent scaling potential. The codes and datasets are available at https://github.com/YuSun-Work/ReasonMed.

ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

TL;DR

ReasonMed tackles the data scarcity and validation challenges in medical reasoning by introducing a large-scale, multi-agent generated dataset of 370k high-quality exemplars distilled from 1.291M validated reasoning trajectories. The authors design a MAS that generates diverse multi-step chain-of-thought paths, validates and refines them through a structured EMD pipeline, and explores training regimes that fuse detailed CoT with concise answer summaries. Empirical results show ReasonMed-7B achieves state-of-the-art performance among sub-10B models on PubMedQA and strong performance across multiple medical QA benchmarks, with further gains when scaling to 14B. The work provides a transferable, data-centric blueprint for constructing domain-specific reasoning datasets and demonstrates the value of explicit reasoning in medical QA, potentially narrowing the gap to much larger models while reducing training costs.

Abstract

Reasoning-based large language models have excelled in mathematics and programming, yet their potential in knowledge-intensive medical question answering remains underexplored and insufficiently validated in clinical contexts. To bridge this gap, we introduce ReasonMed, the largest medical reasoning dataset to date, comprising 370k high-quality examples distilled from 1.75 million initial reasoning paths generated by complementary LLMs and curated through a cost-efficient easy-medium-difficult (EMD) pipeline. ReasonMed is built through a multi-agent generation, verification, and refinement process, in which an Error Refiner improves reasoning paths by correcting error-prone steps identified by a verifier. Using ReasonMed, we investigate effective strategies for training medical reasoning models and find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine-tuning results. Models trained on ReasonMed set a new benchmark: ReasonMed-7B surpasses the prior best sub-10B models by 4.17% and even exceeds LLaMA3.1-70B on PubMedQA by 4.60%. When scaled to ReasonMed-14B, it remains highly competitive, underscoring consistent scaling potential. The codes and datasets are available at https://github.com/YuSun-Work/ReasonMed.

Paper Structure

This paper contains 52 sections, 6 figures, 12 tables.

Figures (6)

  • Figure 1: (1) show composition of the dataset. (2) present the Multi-Agent System for generating and validating Complex CoT. (3) outline strategy schemes (Easy/Medium/Difficult Pipeline) based on CoT validation counts. For 0-4 errors, select top two CoTs using the Quality Ranker. For 5-7 errors, optimize the top two CoTs with GPT-4o-mini, addressing identified weak points. For 8-9 errors, generate high-quality answers using GPT-o1.
  • Figure 2: Knowledge domain differences among DeepSeek-R1-Distill-Llama-70B, HuatuoGPT-o1-70B and Qwen2.5-72B.
  • Figure 3: (1) Shows an example of SFT applied at different scales. (2) to (6) represent the components used to build the entire pipeline for our dataset.
  • Figure 4: Bar chart illustrating the correct and incorrect counts for each model and CoT configuration across 9 generated paths in a Multi-Agent System, totaling 192,628.
  • Figure 5: Distribution of the top two CoT paths selected by the Quality Ranker in Easy Pipeline and Medium Pipeline, showing sampling proportions across models and temperature settings.
  • ...and 1 more figures