Can Rule-Based Insights Enhance LLMs for Radiology Report Classification? Introducing the RadPrompt Methodology
Panagiotis Fytas, Anna Breger, Ian Selby, Simon Baker, Shahab Shahipasand, Anna Korhonen
TL;DR
This work tackles the costly problem of labeling chest X-ray pathologies by leveraging radiology reports as distant supervision. It introduces RadPert, a rule-based system that uses RadGraph’s uncertainty tagging to robustify label extraction with a compact rule set, achieving significant gains over CheXpert on both MIMIC-CXR and CUH. To further harness language model capabilities, the authors propose RadPrompt, a two-turn prompting strategy that uses RadPert hints to refine LLM predictions, yielding statistically significant improvements over zero-shot baselines and, in some cases, over the underlying base models themselves. The approach demonstrates the value of integrating structured rule-based knowledge with powerful LLMs for biomedical NLP tasks, while also acknowledging language- and dataset-specific limitations and ethical considerations for external evaluations.
Abstract
Developing imaging models capable of detecting pathologies from chest X-rays can be cost and time-prohibitive for large datasets as it requires supervision to attain state-of-the-art performance. Instead, labels extracted from radiology reports may serve as distant supervision since these are routinely generated as part of clinical practice. Despite their widespread use, current rule-based methods for label extraction rely on extensive rule sets that are limited in their robustness to syntactic variability. To alleviate these limitations, we introduce RadPert, a rule-based system that integrates an uncertainty-aware information schema with a streamlined set of rules, enhancing performance. Additionally, we have developed RadPrompt, a multi-turn prompting strategy that leverages RadPert to bolster the zero-shot predictive capabilities of large language models, achieving a statistically significant improvement in weighted average F1 score over GPT-4 Turbo. Most notably, RadPrompt surpasses both its underlying models, showcasing the synergistic potential of LLMs with rule-based models. We have evaluated our methods on two English Corpora: the MIMIC-CXR gold-standard test set and a gold-standard dataset collected from the Cambridge University Hospitals.
