RAD-PHI2: Instruction Tuning PHI-2 for Radiology
Mercy Ranjit, Gopinath Ganapathy, Shaury Srivastav, Tanuja Ganu, Srujana Oruganti
TL;DR
The paper tackles the challenge of applying small language models to radiology by fine-tuning Phi-2 on high-quality Radiopaedia data to produce Rad-Phi2-Base and Rad-Phi2. It introduces a two-stage instruction-tuning pipeline—first general-domain tuning, then radiology-specific tuning—alongside a Radiopaedia-based QA dataset and Mimic-CXR-derived report tasks to enable multi-task radiology capabilities. Across lexical, clinical, and GPT-4 based evaluations, Rad-Phi2-Base and Rad-Phi2 are shown to be competitive with or outperform larger models such as Mistral-7B-Instruct-v0.2 and GPT-4 on radiology knowledge queries and reporting tasks, while offering lower compute requirements. The work demonstrates the feasibility and practical impact of domain-specific small language models in radiology workflows, illustrating that high-quality, curated radiology data can empower accessible AI tools for both knowledge retrieval and structured report generation.
Abstract
Small Language Models (SLMs) have shown remarkable performance in general domain language understanding, reasoning and coding tasks, but their capabilities in the medical domain, particularly concerning radiology text, is less explored. In this study, we investigate the application of SLMs for general radiology knowledge specifically question answering related to understanding of symptoms, radiological appearances of findings, differential diagnosis, assessing prognosis, and suggesting treatments w.r.t diseases pertaining to different organ systems. Additionally, we explore the utility of SLMs in handling text-related tasks with respect to radiology reports within AI-driven radiology workflows. We fine-tune Phi-2, a SLM with 2.7 billion parameters using high-quality educational content from Radiopaedia, a collaborative online radiology resource. The resulting language model, RadPhi-2-Base, exhibits the ability to address general radiology queries across various systems (e.g., chest, cardiac). Furthermore, we investigate Phi-2 for instruction tuning, enabling it to perform specific tasks. By fine-tuning Phi-2 on both general domain tasks and radiology-specific tasks related to chest X-ray reports, we create Rad-Phi2. Our empirical results reveal that Rad-Phi2 Base and Rad-Phi2 perform comparably or even outperform larger models such as Mistral-7B-Instruct-v0.2 and GPT-4 providing concise and precise answers. In summary, our work demonstrates the feasibility and effectiveness of utilizing SLMs in radiology workflows both for knowledge related queries as well as for performing specific tasks related to radiology reports thereby opening up new avenues for enhancing the quality and efficiency of radiology practice.
