Identifying Imaging Follow-Up in Radiology Reports: A Comparative Analysis of Traditional ML and LLM Approaches
Namu Park, Giridhar Kaushik Ramachandran, Kevin Lybarger, Fei Xia, Ozlem Uzuner, Meliha Yetisgen, Martin Gunn
TL;DR
This work addresses the challenge of identifying imaging follow-up in radiology reports by introducing a large, annotated corpus of $6{,}393$ reports from $586$ patients and benchmarking a spectrum of methods from traditional ML (LR, SVM) to transformer encoders (Longformer, Llama3-8B-Instruct) and generative LLMs (GPT-4o, GPT-OSS-20B). Evaluations under base and task-optimized prompts reveal that GPT-4o (Advanced) achieves the top F1 of $0.832$, with GPT-OSS-20B (Advanced) close at $0.828$, closely approaching inter-annotator agreement ($0.846$). Importantly, LR and SVM provide strong, resource-efficient baselines ($F1 \,\approx\,0.775$–$0.777$), indicating that well-engineered traditional models remain valuable in clinical NLP. The results highlight the critical role of task-aware prompt design for LLMs and suggest a complementary deployment strategy where open-source models and traditional methods supplement closed-source LLMs for robust, privacy-preserving radiology NLP applications.
Abstract
Large language models (LLMs) have shown considerable promise in clinical natural language processing, yet few domain-specific datasets exist to rigorously evaluate their performance on radiology tasks. In this work, we introduce an annotated corpus of 6,393 radiology reports from 586 patients, each labeled for follow-up imaging status, to support the development and benchmarking of follow-up adherence detection systems. Using this corpus, we systematically compared traditional machine-learning classifiers, including logistic regression (LR), support vector machines (SVM), Longformer, and a fully fine-tuned Llama3-8B-Instruct, with recent generative LLMs. To evaluate generative LLMs, we tested GPT-4o and the open-source GPT-OSS-20B under two configurations: a baseline (Base) and a task-optimized (Advanced) setting that focused inputs on metadata, recommendation sentences, and their surrounding context. A refined prompt for GPT-OSS-20B further improved reasoning accuracy. Performance was assessed using precision, recall, and F1 scores with 95% confidence intervals estimated via non-parametric bootstrapping. Inter-annotator agreement was high (F1 = 0.846). GPT-4o (Advanced) achieved the best performance (F1 = 0.832), followed closely by GPT-OSS-20B (Advanced; F1 = 0.828). LR and SVM also performed strongly (F1 = 0.776 and 0.775), underscoring that while LLMs approach human-level agreement through prompt optimization, interpretable and resource-efficient models remain valuable baselines.
