Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare

Emre Can Acikgoz; Osman Batur İnce; Rayene Bench; Arda Anıl Boz; İlker Kesen; Aykut Erdem; Erkut Erdem

Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare

Emre Can Acikgoz, Osman Batur İnce, Rayene Bench, Arda Anıl Boz, İlker Kesen, Aykut Erdem, Erkut Erdem

TL;DR

Hippocrates introduces an open-source framework to advance medical LLMs with full transparency of data, code, and evaluation. It employs a four-phase pipeline—continued pre-training, supervised fine-tuning, and medical preference learning—using LoRA-based adaptation on LLaMA2 7B and Mistral 7B bases, with RLAIF-driven clinician preferences. The resulting Hippo-7B models outperform existing open medical LLMs and approach or exceed some larger closed models on six clinical benchmarks, supported by a standardized evaluation protocol (LM-Eval Harness). The work also analyzes the contribution of each training stage, prompting strategies, and uncertainty calibration, advocating for reproducibility and broader access to medical AI research resources.

Abstract

The integration of Large Language Models (LLMs) into healthcare promises to transform medical diagnostics, research, and patient care. Yet, the progression of medical LLMs faces obstacles such as complex training requirements, rigorous evaluation demands, and the dominance of proprietary models that restrict academic exploration. Transparent, comprehensive access to LLM resources is essential for advancing the field, fostering reproducibility, and encouraging innovation in healthcare AI. We present Hippocrates, an open-source LLM framework specifically developed for the medical domain. In stark contrast to previous efforts, it offers unrestricted access to its training datasets, codebase, checkpoints, and evaluation protocols. This open approach is designed to stimulate collaborative research, allowing the community to build upon, refine, and rigorously evaluate medical LLMs within a transparent ecosystem. Also, we introduce Hippo, a family of 7B models tailored for the medical domain, fine-tuned from Mistral and LLaMA2 through continual pre-training, instruction tuning, and reinforcement learning from human and AI feedback. Our models outperform existing open medical LLMs models by a large-margin, even surpassing models with 70B parameters. Through Hippocrates, we aspire to unlock the full potential of LLMs not just to advance medical knowledge and patient care but also to democratize the benefits of AI research in healthcare, making them available across the globe.

Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare

TL;DR

Abstract

Paper Structure (59 sections, 7 figures, 14 tables)

This paper contains 59 sections, 7 figures, 14 tables.

Introduction
Hippocrates Framework
Continued Pre-training Data
Supervised Fine-Tuning Data
General Instructions Data.
Evaluation Instructions Data.
Medical Preference Data
Medical RLAIF.
Validation.
Training Methodology
Continued pre-training.
Supervised Finetuning.
Medical Preference Learning.
Main Results
Experimental Setup
...and 44 more sections

Figures (7)

Figure 1: The evolution of medical LLM performances on the MedQA dataset. Our 7B Hippo-and Hippo-models achieve 50.8% and 59.9% 5-shot accuracy, respectively. Hippo-outperforms all existing open models, including even those with 70B parameters.
Figure 2: An overview of the Hippocrates framework, illustrating the four critical phases including (1) continued pre-training, (2) supervised fine-tuning, (3) reinforcement learning from AI-generated feedback, and (4) the comprehensive evaluation pipeline.
Figure 3: Uncertainty quantification for our best-performing 5-shot Hippo-model., where we plot the probability distributions assigned by the model to both correct predictions and incorrect predictions on the MedMCQA, MedQA, and PubMedQA datasets.
Figure 4: The most and least influential MedQA instruction-tuning samples for a MedQA test sample for the Hippo-model. The test sample and the most influential sample are more similar compared to the least influential sample.
Figure 5: Examples of prompts used in the evaluation of MedMCQA, MedQA, and PubMedQA. Format shows the information order in the prompt.
...and 2 more figures

Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare

TL;DR

Abstract

Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare

Authors

TL;DR

Abstract

Table of Contents

Figures (7)