Table of Contents
Fetching ...

UltraMedical: Building Specialized Generalists in Biomedicine

Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, Xingtai Lv, Hu Jinfang, Zhiyuan Liu, Bowen Zhou

TL;DR

The paper tackles the gap between open-source and proprietary biomedical LLMs by adopting a data-centric approach to build specialized generalists. It introduces UltraMedical, a large, diverse dataset of ~410K medical instructions with ~100K preference-annotated samples and ~1.8M preference pairs, plus a四-suite of training and alignment techniques (SFT, DPO/KTO-based preference learning, and reward modeling) applied to Llama-3 8B/70B. Through iterative preference learning and reward modeling, UltraMedical achieves strong performance on medical benchmarks, including an 86.5 score on MedQA-USMLE for the 70B model, and narrows the gap with proprietary systems while maintaining general-domain capabilities. The work provides publicly available datasets and models, offering a practical path toward robust open biomedical LLMs and highlighting future directions in reward modeling, online refinement, and bias mitigation.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains and are moving towards more specialized areas. Recent advanced proprietary models such as GPT-4 and Gemini have achieved significant advancements in biomedicine, which have also raised privacy and security challenges. The construction of specialized generalists hinges largely on high-quality datasets, enhanced by techniques like supervised fine-tuning and reinforcement learning from human or AI feedback, and direct preference optimization. However, these leading technologies (e.g., preference learning) are still significantly limited in the open source community due to the scarcity of specialized data. In this paper, we present the UltraMedical collections, which consist of high-quality manual and synthetic datasets in the biomedicine domain, featuring preference annotations across multiple advanced LLMs. By utilizing these datasets, we fine-tune a suite of specialized medical models based on Llama-3 series, demonstrating breathtaking capabilities across various medical benchmarks. Moreover, we develop powerful reward models skilled in biomedical and general reward benchmark, enhancing further online preference learning within the biomedical LLM community. Datasets and models are available at https://github.com/TsinghuaC3I/UltraMedical

UltraMedical: Building Specialized Generalists in Biomedicine

TL;DR

The paper tackles the gap between open-source and proprietary biomedical LLMs by adopting a data-centric approach to build specialized generalists. It introduces UltraMedical, a large, diverse dataset of ~410K medical instructions with ~100K preference-annotated samples and ~1.8M preference pairs, plus a四-suite of training and alignment techniques (SFT, DPO/KTO-based preference learning, and reward modeling) applied to Llama-3 8B/70B. Through iterative preference learning and reward modeling, UltraMedical achieves strong performance on medical benchmarks, including an 86.5 score on MedQA-USMLE for the 70B model, and narrows the gap with proprietary systems while maintaining general-domain capabilities. The work provides publicly available datasets and models, offering a practical path toward robust open biomedical LLMs and highlighting future directions in reward modeling, online refinement, and bias mitigation.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains and are moving towards more specialized areas. Recent advanced proprietary models such as GPT-4 and Gemini have achieved significant advancements in biomedicine, which have also raised privacy and security challenges. The construction of specialized generalists hinges largely on high-quality datasets, enhanced by techniques like supervised fine-tuning and reinforcement learning from human or AI feedback, and direct preference optimization. However, these leading technologies (e.g., preference learning) are still significantly limited in the open source community due to the scarcity of specialized data. In this paper, we present the UltraMedical collections, which consist of high-quality manual and synthetic datasets in the biomedicine domain, featuring preference annotations across multiple advanced LLMs. By utilizing these datasets, we fine-tune a suite of specialized medical models based on Llama-3 series, demonstrating breathtaking capabilities across various medical benchmarks. Moreover, we develop powerful reward models skilled in biomedical and general reward benchmark, enhancing further online preference learning within the biomedical LLM community. Datasets and models are available at https://github.com/TsinghuaC3I/UltraMedical
Paper Structure (37 sections, 13 figures, 7 tables)

This paper contains 37 sections, 13 figures, 7 tables.

Figures (13)

  • Figure 1: The UltraMedical Datasets, Models and Performance on MedQA.
  • Figure 1: Instructions Statistics. Datasets marked with "" represent our customized synthetic data, while the others are adapted from publicly available data. Average length and score by ChatGPT noted as Avg.Len and Avg.Score.
  • Figure 2: The Construction Pipeline for the UltraMedical Dataset.
  • Figure 3: Broad Topics Distribution
  • Figure 4: Process of Online Preference Learning.
  • ...and 8 more figures