Table of Contents
Fetching ...

Can Rule-Based Insights Enhance LLMs for Radiology Report Classification? Introducing the RadPrompt Methodology

Panagiotis Fytas, Anna Breger, Ian Selby, Simon Baker, Shahab Shahipasand, Anna Korhonen

TL;DR

This work tackles the costly problem of labeling chest X-ray pathologies by leveraging radiology reports as distant supervision. It introduces RadPert, a rule-based system that uses RadGraph’s uncertainty tagging to robustify label extraction with a compact rule set, achieving significant gains over CheXpert on both MIMIC-CXR and CUH. To further harness language model capabilities, the authors propose RadPrompt, a two-turn prompting strategy that uses RadPert hints to refine LLM predictions, yielding statistically significant improvements over zero-shot baselines and, in some cases, over the underlying base models themselves. The approach demonstrates the value of integrating structured rule-based knowledge with powerful LLMs for biomedical NLP tasks, while also acknowledging language- and dataset-specific limitations and ethical considerations for external evaluations.

Abstract

Developing imaging models capable of detecting pathologies from chest X-rays can be cost and time-prohibitive for large datasets as it requires supervision to attain state-of-the-art performance. Instead, labels extracted from radiology reports may serve as distant supervision since these are routinely generated as part of clinical practice. Despite their widespread use, current rule-based methods for label extraction rely on extensive rule sets that are limited in their robustness to syntactic variability. To alleviate these limitations, we introduce RadPert, a rule-based system that integrates an uncertainty-aware information schema with a streamlined set of rules, enhancing performance. Additionally, we have developed RadPrompt, a multi-turn prompting strategy that leverages RadPert to bolster the zero-shot predictive capabilities of large language models, achieving a statistically significant improvement in weighted average F1 score over GPT-4 Turbo. Most notably, RadPrompt surpasses both its underlying models, showcasing the synergistic potential of LLMs with rule-based models. We have evaluated our methods on two English Corpora: the MIMIC-CXR gold-standard test set and a gold-standard dataset collected from the Cambridge University Hospitals.

Can Rule-Based Insights Enhance LLMs for Radiology Report Classification? Introducing the RadPrompt Methodology

TL;DR

This work tackles the costly problem of labeling chest X-ray pathologies by leveraging radiology reports as distant supervision. It introduces RadPert, a rule-based system that uses RadGraph’s uncertainty tagging to robustify label extraction with a compact rule set, achieving significant gains over CheXpert on both MIMIC-CXR and CUH. To further harness language model capabilities, the authors propose RadPrompt, a two-turn prompting strategy that uses RadPert hints to refine LLM predictions, yielding statistically significant improvements over zero-shot baselines and, in some cases, over the underlying base models themselves. The approach demonstrates the value of integrating structured rule-based knowledge with powerful LLMs for biomedical NLP tasks, while also acknowledging language- and dataset-specific limitations and ethical considerations for external evaluations.

Abstract

Developing imaging models capable of detecting pathologies from chest X-rays can be cost and time-prohibitive for large datasets as it requires supervision to attain state-of-the-art performance. Instead, labels extracted from radiology reports may serve as distant supervision since these are routinely generated as part of clinical practice. Despite their widespread use, current rule-based methods for label extraction rely on extensive rule sets that are limited in their robustness to syntactic variability. To alleviate these limitations, we introduce RadPert, a rule-based system that integrates an uncertainty-aware information schema with a streamlined set of rules, enhancing performance. Additionally, we have developed RadPrompt, a multi-turn prompting strategy that leverages RadPert to bolster the zero-shot predictive capabilities of large language models, achieving a statistically significant improvement in weighted average F1 score over GPT-4 Turbo. Most notably, RadPrompt surpasses both its underlying models, showcasing the synergistic potential of LLMs with rule-based models. We have evaluated our methods on two English Corpora: the MIMIC-CXR gold-standard test set and a gold-standard dataset collected from the Cambridge University Hospitals.
Paper Structure (18 sections, 4 figures, 14 tables)

This paper contains 18 sections, 4 figures, 14 tables.

Figures (4)

  • Figure 1: Overview of the RadPrompt methodology. RadPrompt utilizes the rule-based RadPert model to detect potential errors in the original (first-turn) LLM classification decision. A second-turn prompt is then constructed, offering evidence that may cause the LLM to revise its original classification outcome.
  • Figure 2: Examples of RadPert rules for Cardiomegaly. The rules take the form of graphs that follow the RadGraph RadGraphNeurips information schema. The ".*" symbolizes allowing the matching of different prefixes and suffixes within the entity span.
  • Figure 3: Normalized confusion matrices for MIMIC-CXR gold-standard test set.
  • Figure 4: Normalized confusion matrices for CUH test set.