Table of Contents
Fetching ...

Med42 -- Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches

Clément Christophe, Praveen K Kanithi, Prateek Munjal, Tathagata Raha, Nasir Hayat, Ronnie Rajan, Ahmed Al-Mahrooqi, Avani Gupta, Muhammad Umar Salman, Gurpreet Gosal, Bhargav Kanakiya, Charles Chen, Natalia Vassilieva, Boulbaba Ben Amor, Marco AF Pimentel, Shadab Khan

TL;DR

The paper tackles how best to fine-tune medical LLMs by comparing full-parameter fine-tuning with parameter-efficient approaches (notably LoRA) on Med42, a Llama-2–based model family. It builds Med42 from 7B and 70B variants, trains with a diverse medical instruction-tuning dataset, and evaluates on MedQA, HeadQA, MedMCQA, PubMedQA, MMLU clinical topics, and USMLE materials, with a decontamination pipeline to ensure data integrity. Key findings show full-parameter fine-tuning generally outperforms LoRA across most medical benchmarks, with 70B FP-FT achieving about a 72% USMLE-average, while LoRA remains competitive when computational resources are limited. The study contributes by releasing Med42 openly, detailing a robust data pipeline and evaluation protocol, and highlighting practical guidance for deploying medical LLMs with a focus on accuracy, reproducibility, and ethical considerations.

Abstract

This study presents a comprehensive analysis and comparison of two predominant fine-tuning methodologies - full-parameter fine-tuning and parameter-efficient tuning - within the context of medical Large Language Models (LLMs). We developed and refined a series of LLMs, based on the Llama-2 architecture, specifically designed to enhance medical knowledge retrieval, reasoning, and question-answering capabilities. Our experiments systematically evaluate the effectiveness of these tuning strategies across various well-known medical benchmarks. Notably, our medical LLM Med42 showed an accuracy level of 72% on the US Medical Licensing Examination (USMLE) datasets, setting a new standard in performance for openly available medical LLMs. Through this comparative analysis, we aim to identify the most effective and efficient method for fine-tuning LLMs in the medical domain, thereby contributing significantly to the advancement of AI-driven healthcare applications.

Med42 -- Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches

TL;DR

The paper tackles how best to fine-tune medical LLMs by comparing full-parameter fine-tuning with parameter-efficient approaches (notably LoRA) on Med42, a Llama-2–based model family. It builds Med42 from 7B and 70B variants, trains with a diverse medical instruction-tuning dataset, and evaluates on MedQA, HeadQA, MedMCQA, PubMedQA, MMLU clinical topics, and USMLE materials, with a decontamination pipeline to ensure data integrity. Key findings show full-parameter fine-tuning generally outperforms LoRA across most medical benchmarks, with 70B FP-FT achieving about a 72% USMLE-average, while LoRA remains competitive when computational resources are limited. The study contributes by releasing Med42 openly, detailing a robust data pipeline and evaluation protocol, and highlighting practical guidance for deploying medical LLMs with a focus on accuracy, reproducibility, and ethical considerations.

Abstract

This study presents a comprehensive analysis and comparison of two predominant fine-tuning methodologies - full-parameter fine-tuning and parameter-efficient tuning - within the context of medical Large Language Models (LLMs). We developed and refined a series of LLMs, based on the Llama-2 architecture, specifically designed to enhance medical knowledge retrieval, reasoning, and question-answering capabilities. Our experiments systematically evaluate the effectiveness of these tuning strategies across various well-known medical benchmarks. Notably, our medical LLM Med42 showed an accuracy level of 72% on the US Medical Licensing Examination (USMLE) datasets, setting a new standard in performance for openly available medical LLMs. Through this comparative analysis, we aim to identify the most effective and efficient method for fine-tuning LLMs in the medical domain, thereby contributing significantly to the advancement of AI-driven healthcare applications.
Paper Structure (21 sections, 3 figures, 4 tables)

This paper contains 21 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Performance of 7-billion (left) and 70-billion (right) parameter models on various medical-related benchmark datasets (in zero-shot setting). Performance results (accuracy) are displayed in % for the base and fine-tuned models.
  • Figure 1: Two examples of contaminated samples from our instruction-tuning (left) and evaluation datasets.
  • Figure 2: Accuracy change after decontamination for both (70b) fine-tuned models (shown in %).