Enhancing disease detection in radiology reports through fine-tuning lightweight LLM on weak labels
Yishu Wei, Xindi Wang, Hanley Ong, Yiliang Zhou, Adam Flanders, George Shih, Yifan Peng
TL;DR
This work tackles the practical restriction of deploying large LLMs in radiology by fine-tuning a lightweight model (Llama 3.1-8B) with synthetic, weak labels across two tasks: a multiple-choice disease classification and an open-ended disease-detection task mapped to ICD-10 concepts. The authors implement a joint, instruction-tuning framework using LoRA with carefully designed prompts and synthetic label sources (NegBio for MC; GPT-4o for open-ended), and train on data from NIH-CXR/MIRDC, WCM, and MIMIC-CXR. Results show that high-quality synthetic labels enable the open-ended task to reach micro-F1 near GPT-4o levels ($0.91$ vs $0.93$) and that MC classification improves over a noisy teacher ($0.67$ vs $0.63$), with optimal gains around $9{,}000$ fine-tuning samples and a learning rate near $1\times 10^{-5}$. Overall, the study demonstrates the potential of synthetic-label fine-tuning to specialize small LLMs for radiology, suggesting scalable pathways to cohort-specific disease detection while maintaining practical computational demands.
Abstract
Despite significant progress in applying large language models (LLMs) to the medical domain, several limitations still prevent them from practical applications. Among these are the constraints on model size and the lack of cohort-specific labeled datasets. In this work, we investigated the potential of improving a lightweight LLM, such as Llama 3.1-8B, through fine-tuning with datasets using synthetic labels. Two tasks are jointly trained by combining their respective instruction datasets. When the quality of the task-specific synthetic labels is relatively high (e.g., generated by GPT4- o), Llama 3.1-8B achieves satisfactory performance on the open-ended disease detection task, with a micro F1 score of 0.91. Conversely, when the quality of the task-relevant synthetic labels is relatively low (e.g., from the MIMIC-CXR dataset), fine-tuned Llama 3.1-8B is able to surpass its noisy teacher labels (micro F1 score of 0.67 v.s. 0.63) when calibrated against curated labels, indicating the strong inherent underlying capability of the model. These findings demonstrate the potential of fine-tuning LLMs with synthetic labels, offering a promising direction for future research on LLM specialization in the medical domain.
