Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks

Hyunjae Kim; Hyeon Hwang; Jiwoo Lee; Sihyeon Park; Dain Kim; Taewhoo Lee; Chanwoong Yoon; Jiwoong Sohn; Donghee Choi; Jaewoo Kang

Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks

Hyunjae Kim, Hyeon Hwang, Jiwoo Lee, Sihyeon Park, Dain Kim, Taewhoo Lee, Chanwoong Yoon, Jiwoong Sohn, Donghee Choi, Jaewoo Kang

TL;DR

Meerkat presents a family of open-source medical LMs (7B–70B) trained with a large-scale synthetic chain-of-thought dataset derived from 18 textbooks and the MedQA USMLE-style questions, achieving state-of-the-art performance across six medical benchmarks. The 7B model surpasses the USMLE threshold, while the 70B model outperforms GPT-4 by a small margin on average and closely matches GPT-4 on complex real-world cases, with substantially richer free-form clinical responses than prior small models. An ablation study demonstrates that chain-of-thought fine-tuning and textbook-based data augmentation significantly boost accuracy, and model selection favors general-purpose backbones (e.g., Mistral-7B) for efficiency and performance. The work emphasizes open, on-premise deployment, introduces the MedBooks-CoT-18 dataset, and highlights the need for further improvements in factuality and safety before real-world deployment, while offering a practical path toward privacy-preserving, high-reasoning medical AI.

Abstract

While recent advancements in commercial large language models (LM) have shown promising results in medical tasks, their closed-source nature poses significant privacy and security concerns, hindering their widespread use in the medical field. Despite efforts to create open-source models, their limited parameters often result in insufficient multi-step reasoning capabilities required for solving complex medical problems. To address this, we introduce Meerkat, a new family of medical AI systems ranging from 7 to 70 billion parameters. The models were trained using our new synthetic dataset consisting of high-quality chain-of-thought reasoning paths sourced from 18 medical textbooks, along with diverse instruction-following datasets. Our systems achieved remarkable accuracy across six medical benchmarks, surpassing the previous best models such as MediTron and BioMistral, and GPT-3.5 by a large margin. Notably, Meerkat-7B surpassed the passing threshold of the United States Medical Licensing Examination (USMLE) for the first time for a 7B-parameter model, while Meerkat-70B outperformed GPT-4 by an average of 1.3%. Additionally, Meerkat-70B correctly diagnosed 21 out of 38 complex clinical cases, outperforming humans' 13.8 and closely matching GPT-4's 21.8. Our systems offered more detailed free-form responses to clinical queries compared to existing small models, approaching the performance level of large commercial models. This significantly narrows the performance gap with large LMs, showcasing its effectiveness in addressing complex medical challenges.

Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks

TL;DR

Abstract

Paper Structure (24 sections, 1 equation, 7 figures, 11 tables)

This paper contains 24 sections, 1 equation, 7 figures, 11 tables.

Introduction
Results
Main Results
Multiple-Choice QA
NEJM Case Challenges
Real-world Clinical Questions
Ablation Study
Effect of CoT Fine-tuning
Effect of Textbook Augmentation
Language Model Selection
Assessment of Model Explanations
Discussion
Methods
Training
Chain-of-thought Reasoning Data Generation
...and 9 more sections

Figures (7)

Figure 1: Overview of recent advances in language models (LM) based on their performance on the MedQA benchmark jin2021disease. Large closed-source models have surpassed the USMLE passing threshold, reaching a state-of-the-art performance with 91.1% accuracy saab2024capabilities. On the other hand, the previous best open-source model, MediTron-70B chen2023meditron, has achieved a score of only 70.2%, while no 7B-scale models have managed to surpass the USMLE passing threshold (60%). Our new open-source model, Meerkat-7B, has achieved an accuracy of 77.1%, demonstrating notable progress in open-source model development in the medical domain. Additionally, our new 8B and 70B models have further pushed the state-of-the-art performance for open-source medical AIs.
Figure 2: Performance of models on six multiple-choice QA benchmark datasets and NEJM case challenges. Our Meerkat models generally performed better than existing 7B models and GPT-3.5 across the six datasets and outperformed MediTron-70B on MedQA. Additionally, our 70B model exceeded GPT-4 in performance. The scores of GPT-3.5, GPT-4 and MediTron-70B are obtained from the papers of nori2023capabilities, toma2023clinical, chen2023meditron, and chen2024benchmarking. In subfigure (b), MediTron-70B (SC) denotes that the self-consistency CoT prompting method was employed during the model's evaluation wang2023self, whereas Ensemble 5-20 refers to the number of runs of choice shuffling ensemble nori2023capabilities. Detailed scores for the six benchmarks shown in subfigure (a) are provided in Table \ref{['tab:six_benchmarks']}.
Figure 3: Performance comparison of six language models trained with three different datasets on the MedQA benchmark. Mistral-7B, Gemma-7B google2024gemma, and LLaMA-3-8B performed better than MediTron-7B and BioMistral-7B, despite not being specialized models for biomedicine. "MedQA": training the model only using question-answer pairs in the MedQA training set. "MedQA-CoT": training the model using MedQA question-answer pairs and corresponding CoT reasoning data. "MedQA-CoT + MedBooks-CoT-18": training the model using the MedQA-CoT data and additional CoT data generated from textbooks.
Figure 4: Evaluation of model explanations for USMLE-style questions. the scores were measured by comparison with human explanations. Meerkat-7B performed the best according to ROUGE-L lin2004rouge and BERTScore zhang2019bertscore, and it ranked second in terms of the GPT-4 score. "O" denotes explanations for questions that Meerkat-7B answered correctly, while "X" indicates explanations for questions the model answered incorrectly.
Figure 5: The overall process of generating synthetic chain-of-thought (CoT) data. (1) GPT-4 was prompted to provide answers, along with step-by-step explanations, for USMLE-style questions from MedQA jin2021disease, resulting in the creation of 9.3K CoT data. (2) GPT-4 received three randomly sampled questions from MedQA and text chunks from medical textbooks as input to produce synthetic question-answer pairs. (3) GPT-4 was then prompted to generate step-by-step explanations for these generated questions, resulting in an additional 78K CoT data.
...and 2 more figures

Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks

TL;DR

Abstract

Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks

Authors

TL;DR

Abstract

Table of Contents

Figures (7)