Med42 -- Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches
Clément Christophe, Praveen K Kanithi, Prateek Munjal, Tathagata Raha, Nasir Hayat, Ronnie Rajan, Ahmed Al-Mahrooqi, Avani Gupta, Muhammad Umar Salman, Gurpreet Gosal, Bhargav Kanakiya, Charles Chen, Natalia Vassilieva, Boulbaba Ben Amor, Marco AF Pimentel, Shadab Khan
TL;DR
The paper tackles how best to fine-tune medical LLMs by comparing full-parameter fine-tuning with parameter-efficient approaches (notably LoRA) on Med42, a Llama-2–based model family. It builds Med42 from 7B and 70B variants, trains with a diverse medical instruction-tuning dataset, and evaluates on MedQA, HeadQA, MedMCQA, PubMedQA, MMLU clinical topics, and USMLE materials, with a decontamination pipeline to ensure data integrity. Key findings show full-parameter fine-tuning generally outperforms LoRA across most medical benchmarks, with 70B FP-FT achieving about a 72% USMLE-average, while LoRA remains competitive when computational resources are limited. The study contributes by releasing Med42 openly, detailing a robust data pipeline and evaluation protocol, and highlighting practical guidance for deploying medical LLMs with a focus on accuracy, reproducibility, and ethical considerations.
Abstract
This study presents a comprehensive analysis and comparison of two predominant fine-tuning methodologies - full-parameter fine-tuning and parameter-efficient tuning - within the context of medical Large Language Models (LLMs). We developed and refined a series of LLMs, based on the Llama-2 architecture, specifically designed to enhance medical knowledge retrieval, reasoning, and question-answering capabilities. Our experiments systematically evaluate the effectiveness of these tuning strategies across various well-known medical benchmarks. Notably, our medical LLM Med42 showed an accuracy level of 72% on the US Medical Licensing Examination (USMLE) datasets, setting a new standard in performance for openly available medical LLMs. Through this comparative analysis, we aim to identify the most effective and efficient method for fine-tuning LLMs in the medical domain, thereby contributing significantly to the advancement of AI-driven healthcare applications.
