MGH Radiology Llama: A Llama 3 70B Model for Radiology
Yucheng Shi, Peng Shu, Zhengliang Liu, Zihao Wu, Quanzheng Li, Tianming Liu, Ninghao Liu, Xiang Li
TL;DR
The paper addresses the challenge of generating accurate radiology impressions from radiology findings by introducing MGH Radiology Llama, a radiology-focused model built on Llama 3-70B and trained on a massive, de-identified Massachusetts General Hospital dataset. It combines instruction-based fine-tuning with both full fine-tuning and QLoRA approaches, leveraging 4-bit quantization and DeepSpeed ZeRO on eight NVIDIA H100 GPUs to efficiently scale training. Evaluation integrates traditional metrics (ROUGE-L, BERTScore) with GPT-4o-based clinician judgments, showing substantial improvements over baseline and prior domain-specific models in generating concise, clinically relevant impressions. The work highlights the potential of large, domain-specific LLMs for radiology reporting while acknowledging limitations such as hallucination and the need for ongoing data curation and evaluation improvements to enable safer clinical deployment.
Abstract
In recent years, the field of radiology has increasingly harnessed the power of artificial intelligence (AI) to enhance diagnostic accuracy, streamline workflows, and improve patient care. Large language models (LLMs) have emerged as particularly promising tools, offering significant potential in assisting radiologists with report generation, clinical decision support, and patient communication. This paper presents an advanced radiology-focused large language model: MGH Radiology Llama. It is developed using the Llama 3 70B model, building upon previous domain-specific models like Radiology-GPT and Radiology-Llama2. Leveraging a unique and comprehensive dataset from Massachusetts General Hospital, comprising over 6.5 million de-identified medical reports across various imaging modalities, the model demonstrates significant improvements in generating accurate and clinically relevant radiology impressions given the corresponding findings. Our evaluation, incorporating both traditional metrics and a GPT-4-based assessment, highlights the enhanced performance of this work over general-purpose LLMs.
