Table of Contents
Fetching ...

Enhancing disease detection in radiology reports through fine-tuning lightweight LLM on weak labels

Yishu Wei, Xindi Wang, Hanley Ong, Yiliang Zhou, Adam Flanders, George Shih, Yifan Peng

TL;DR

This work tackles the practical restriction of deploying large LLMs in radiology by fine-tuning a lightweight model (Llama 3.1-8B) with synthetic, weak labels across two tasks: a multiple-choice disease classification and an open-ended disease-detection task mapped to ICD-10 concepts. The authors implement a joint, instruction-tuning framework using LoRA with carefully designed prompts and synthetic label sources (NegBio for MC; GPT-4o for open-ended), and train on data from NIH-CXR/MIRDC, WCM, and MIMIC-CXR. Results show that high-quality synthetic labels enable the open-ended task to reach micro-F1 near GPT-4o levels ($0.91$ vs $0.93$) and that MC classification improves over a noisy teacher ($0.67$ vs $0.63$), with optimal gains around $9{,}000$ fine-tuning samples and a learning rate near $1\times 10^{-5}$. Overall, the study demonstrates the potential of synthetic-label fine-tuning to specialize small LLMs for radiology, suggesting scalable pathways to cohort-specific disease detection while maintaining practical computational demands.

Abstract

Despite significant progress in applying large language models (LLMs) to the medical domain, several limitations still prevent them from practical applications. Among these are the constraints on model size and the lack of cohort-specific labeled datasets. In this work, we investigated the potential of improving a lightweight LLM, such as Llama 3.1-8B, through fine-tuning with datasets using synthetic labels. Two tasks are jointly trained by combining their respective instruction datasets. When the quality of the task-specific synthetic labels is relatively high (e.g., generated by GPT4- o), Llama 3.1-8B achieves satisfactory performance on the open-ended disease detection task, with a micro F1 score of 0.91. Conversely, when the quality of the task-relevant synthetic labels is relatively low (e.g., from the MIMIC-CXR dataset), fine-tuned Llama 3.1-8B is able to surpass its noisy teacher labels (micro F1 score of 0.67 v.s. 0.63) when calibrated against curated labels, indicating the strong inherent underlying capability of the model. These findings demonstrate the potential of fine-tuning LLMs with synthetic labels, offering a promising direction for future research on LLM specialization in the medical domain.

Enhancing disease detection in radiology reports through fine-tuning lightweight LLM on weak labels

TL;DR

This work tackles the practical restriction of deploying large LLMs in radiology by fine-tuning a lightweight model (Llama 3.1-8B) with synthetic, weak labels across two tasks: a multiple-choice disease classification and an open-ended disease-detection task mapped to ICD-10 concepts. The authors implement a joint, instruction-tuning framework using LoRA with carefully designed prompts and synthetic label sources (NegBio for MC; GPT-4o for open-ended), and train on data from NIH-CXR/MIRDC, WCM, and MIMIC-CXR. Results show that high-quality synthetic labels enable the open-ended task to reach micro-F1 near GPT-4o levels ( vs ) and that MC classification improves over a noisy teacher ( vs ), with optimal gains around fine-tuning samples and a learning rate near . Overall, the study demonstrates the potential of synthetic-label fine-tuning to specialize small LLMs for radiology, suggesting scalable pathways to cohort-specific disease detection while maintaining practical computational demands.

Abstract

Despite significant progress in applying large language models (LLMs) to the medical domain, several limitations still prevent them from practical applications. Among these are the constraints on model size and the lack of cohort-specific labeled datasets. In this work, we investigated the potential of improving a lightweight LLM, such as Llama 3.1-8B, through fine-tuning with datasets using synthetic labels. Two tasks are jointly trained by combining their respective instruction datasets. When the quality of the task-specific synthetic labels is relatively high (e.g., generated by GPT4- o), Llama 3.1-8B achieves satisfactory performance on the open-ended disease detection task, with a micro F1 score of 0.91. Conversely, when the quality of the task-relevant synthetic labels is relatively low (e.g., from the MIMIC-CXR dataset), fine-tuned Llama 3.1-8B is able to surpass its noisy teacher labels (micro F1 score of 0.67 v.s. 0.63) when calibrated against curated labels, indicating the strong inherent underlying capability of the model. These findings demonstrate the potential of fine-tuning LLMs with synthetic labels, offering a promising direction for future research on LLM specialization in the medical domain.
Paper Structure (24 sections, 4 figures, 6 tables)

This paper contains 24 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Samples of instruction, input, and output for disease detection in radiology reports.
  • Figure 2: Overview of the experiment setup
  • Figure 3: Performance of Llama 3.1-8B on the multiple-choice disease classification task with different learning rates. The model was trained on 100,000 samples from the MIMIC-CXR dataset and then tested on human-curated data from the same MIMIC-CXR source.
  • Figure 4: Performance of Llama 3.1-8B using micro-precision, micro-recall, and micro-F1 scores. (a) Multiple-choice disease classification. The model was trained on the MIMIC-CXR datasets only and then tested on human-curated data from the MIMIC-CXR dataset. (b) Open-ended disease detection. The model was trained on the WCM dataset only and then tested on the NIH/MIDRC dataset. Learning rate was $3\times 10^{-4}$.