Table of Contents
Fetching ...

Automated Identification of Incidentalomas Requiring Follow-Up: A Multi-Anatomy Evaluation of LLM-Based and Supervised Approaches

Namu Park, Farzad Ahmed, Zhaoyi Sun, Kevin Lybarger, Ethan Breinhorst, Julie Hu, Ozlem Uzuner, Martin Gunn, Meliha Yetisgen

TL;DR

This work tackles incidentaloma follow-up identification in radiology by combining lesion-level tagging with anatomy-grounded prompts. It benchmarks supervised transformer encoders against generative LLMs, showing anatomy-informed prompting substantially boosts performance, with GPT-OSS-20B achieving the highest incidentaloma macro-F1 and ensemble methods reaching near-human agreement. The findings indicate that structured lesion context and anatomical grounding enable reliable, interpretable automated surveillance of incidental findings in radiology workflows. The approach supports scalable decision support and follow-up tracking, while highlighting trade-offs in computation and the need for multi-institution validation and longitudinal data for broader deployment.

Abstract

Objective: To evaluate large language models (LLMs) against supervised baselines for fine-grained, lesion-level detection of incidentalomas requiring follow-up, addressing the limitations of current document-level classification systems. Methods: We utilized a dataset of 400 annotated radiology reports containing 1,623 verified lesion findings. We compared three supervised transformer-based encoders (BioClinicalModernBERT, ModernBERT, Clinical Longformer) against four generative LLM configurations (Llama 3.1-8B, GPT-4o, GPT-OSS-20b). We introduced a novel inference strategy using lesion-tagged inputs and anatomy-aware prompting to ground model reasoning. Performance was evaluated using class-specific F1-scores. Results: The anatomy-informed GPT-OSS-20b model achieved the highest performance, yielding an incidentaloma-positive macro-F1 of 0.79. This surpassed all supervised baselines (maximum macro-F1: 0.70) and closely matched the inter-annotator agreement of 0.76. Explicit anatomical grounding yielded statistically significant performance gains across GPT-based models (p < 0.05), while a majority-vote ensemble of the top systems further improved the macro-F1 to 0.90. Error analysis revealed that anatomy-aware LLMs demonstrated superior contextual reasoning in distinguishing actionable findings from benign lesions. Conclusion: Generative LLMs, when enhanced with structured lesion tagging and anatomical context, significantly outperform traditional supervised encoders and achieve performance comparable to human experts. This approach offers a reliable, interpretable pathway for automated incidental finding surveillance in radiology workflows.

Automated Identification of Incidentalomas Requiring Follow-Up: A Multi-Anatomy Evaluation of LLM-Based and Supervised Approaches

TL;DR

This work tackles incidentaloma follow-up identification in radiology by combining lesion-level tagging with anatomy-grounded prompts. It benchmarks supervised transformer encoders against generative LLMs, showing anatomy-informed prompting substantially boosts performance, with GPT-OSS-20B achieving the highest incidentaloma macro-F1 and ensemble methods reaching near-human agreement. The findings indicate that structured lesion context and anatomical grounding enable reliable, interpretable automated surveillance of incidental findings in radiology workflows. The approach supports scalable decision support and follow-up tracking, while highlighting trade-offs in computation and the need for multi-institution validation and longitudinal data for broader deployment.

Abstract

Objective: To evaluate large language models (LLMs) against supervised baselines for fine-grained, lesion-level detection of incidentalomas requiring follow-up, addressing the limitations of current document-level classification systems. Methods: We utilized a dataset of 400 annotated radiology reports containing 1,623 verified lesion findings. We compared three supervised transformer-based encoders (BioClinicalModernBERT, ModernBERT, Clinical Longformer) against four generative LLM configurations (Llama 3.1-8B, GPT-4o, GPT-OSS-20b). We introduced a novel inference strategy using lesion-tagged inputs and anatomy-aware prompting to ground model reasoning. Performance was evaluated using class-specific F1-scores. Results: The anatomy-informed GPT-OSS-20b model achieved the highest performance, yielding an incidentaloma-positive macro-F1 of 0.79. This surpassed all supervised baselines (maximum macro-F1: 0.70) and closely matched the inter-annotator agreement of 0.76. Explicit anatomical grounding yielded statistically significant performance gains across GPT-based models (p < 0.05), while a majority-vote ensemble of the top systems further improved the macro-F1 to 0.90. Error analysis revealed that anatomy-aware LLMs demonstrated superior contextual reasoning in distinguishing actionable findings from benign lesions. Conclusion: Generative LLMs, when enhanced with structured lesion tagging and anatomical context, significantly outperform traditional supervised encoders and achieve performance comparable to human experts. This approach offers a reliable, interpretable pathway for automated incidental finding surveillance in radiology workflows.

Paper Structure

This paper contains 25 sections, 3 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Prompt used for verifying incidentaloma status across target anatomies. Exclusion criteria and examples are derived from the annotation guidelines.
  • Figure 2: Example of input and output used for LLM-based incidentaloma identification using GPT-OSS-20B. Lesions that are not returned in the JSON output are treated as No Incidentaloma (Class 0). Reasoning traces are available only in GPT-OSS-20B inferences, as GPT-4o does not provide reasoning trace outputs.
  • Figure 3: Pairwise non-parametric bootstrap comparison of model performance on incidentaloma-positive lesions. Each point represents the mean difference in Macro-F1 (A--B) across 1,000 lesion-level bootstrap samples, with horizontal bars showing 95% confidence intervals. Comparisons to the right of zero indicate that Model A outperformed Model B.