Table of Contents
Fetching ...

LLM-RG4: Flexible and Factual Radiology Report Generation across Diverse Input Contexts

Zhuhao Wang, Yihua Sun, Zihan Li, Xuan Yang, Fang Chen, Hongen Liao

TL;DR

This work addresses the mismatch between input context and report generation in radiology by introducing MIMIC-RG4, a four-scenario data paradigm that mirrors real-world clinical drafting. It presents LLM-RG4, an architecture that combines a modality encoder, Adaptive Token Fusion (ATF) to maintain fixed input length across diverse inputs, and a Token-Level Loss Weighting (TLW) strategy to prioritize positive and uncertain diagnoses, thereby reducing input-agnostic hallucinations. The approach achieves state-of-the-art clinical efficacy and natural language generation while substantially limiting hallucinations on MIMIC-RG4 and MIMIC-CXR, validated through ablations and case studies. This framework promises practical impact by enabling flexible, faithful radiology report generation aligned with clinicians’ information needs and input availability.

Abstract

Drafting radiology reports is a complex task requiring flexibility, where radiologists tail content to available information and particular clinical demands. However, most current radiology report generation (RRG) models are constrained to a fixed task paradigm, such as predicting the full ``finding'' section from a single image, inherently involving a mismatch between inputs and outputs. The trained models lack the flexibility for diverse inputs and could generate harmful, input-agnostic hallucinations. To bridge the gap between current RRG models and the clinical demands in practice, we first develop a data generation pipeline to create a new MIMIC-RG4 dataset, which considers four common radiology report drafting scenarios and has perfectly corresponded input and output. Secondly, we propose a novel large language model (LLM) based RRG framework, namely LLM-RG4, which utilizes LLM's flexible instruction-following capabilities and extensive general knowledge. We further develop an adaptive token fusion module that offers flexibility to handle diverse scenarios with different input combinations, while minimizing the additional computational burden associated with increased input volumes. Besides, we propose a token-level loss weighting strategy to direct the model's attention towards positive and uncertain descriptions. Experimental results demonstrate that LLM-RG4 achieves state-of-the-art performance in both clinical efficiency and natural language generation on the MIMIC-RG4 and MIMIC-CXR datasets. We quantitatively demonstrate that our model has minimal input-agnostic hallucinations, whereas current open-source models commonly suffer from this problem.

LLM-RG4: Flexible and Factual Radiology Report Generation across Diverse Input Contexts

TL;DR

This work addresses the mismatch between input context and report generation in radiology by introducing MIMIC-RG4, a four-scenario data paradigm that mirrors real-world clinical drafting. It presents LLM-RG4, an architecture that combines a modality encoder, Adaptive Token Fusion (ATF) to maintain fixed input length across diverse inputs, and a Token-Level Loss Weighting (TLW) strategy to prioritize positive and uncertain diagnoses, thereby reducing input-agnostic hallucinations. The approach achieves state-of-the-art clinical efficacy and natural language generation while substantially limiting hallucinations on MIMIC-RG4 and MIMIC-CXR, validated through ablations and case studies. This framework promises practical impact by enabling flexible, faithful radiology report generation aligned with clinicians’ information needs and input availability.

Abstract

Drafting radiology reports is a complex task requiring flexibility, where radiologists tail content to available information and particular clinical demands. However, most current radiology report generation (RRG) models are constrained to a fixed task paradigm, such as predicting the full ``finding'' section from a single image, inherently involving a mismatch between inputs and outputs. The trained models lack the flexibility for diverse inputs and could generate harmful, input-agnostic hallucinations. To bridge the gap between current RRG models and the clinical demands in practice, we first develop a data generation pipeline to create a new MIMIC-RG4 dataset, which considers four common radiology report drafting scenarios and has perfectly corresponded input and output. Secondly, we propose a novel large language model (LLM) based RRG framework, namely LLM-RG4, which utilizes LLM's flexible instruction-following capabilities and extensive general knowledge. We further develop an adaptive token fusion module that offers flexibility to handle diverse scenarios with different input combinations, while minimizing the additional computational burden associated with increased input volumes. Besides, we propose a token-level loss weighting strategy to direct the model's attention towards positive and uncertain descriptions. Experimental results demonstrate that LLM-RG4 achieves state-of-the-art performance in both clinical efficiency and natural language generation on the MIMIC-RG4 and MIMIC-CXR datasets. We quantitatively demonstrate that our model has minimal input-agnostic hallucinations, whereas current open-source models commonly suffer from this problem.

Paper Structure

This paper contains 32 sections, 6 equations, 7 figures, 16 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a) Mismatch between image and report in typical RRG model. Comparisons, procedures, communication and views are uninferable. (b) A flexible and factual RRG paradigm, which emphasizes the flexibility of input and the alignment between input and output.
  • Figure 2: The pipeline employs an iterative approach that integrates a BERT-based discriminator and a LLM-based generator, ensuring minimal input-agnostic information and effective information loss.
  • Figure 3: The LLM-RG4 architecture consists of a modality encoder, an adaptive token fusion module, and a token-level loss weighting strategy. The modality encoder extracts features from various modalities. The adaptive token fusion module combines different feature tokens into a fixed length, minimizing computational burden. The token-level loss weighting strategy identifies key diagnoses and adjusts token loss weights, enhancing the model's clinical efficacy across diverse input scenarios.
  • Figure 4: An illustration of a challenging case featuring five positive or uncertain diagnoses across four different settings, where the gold standard is also presented for reference. Diagnosis shared by the gold standard and model outputs are highlighted in the same color. LLM-RG4 identifies nearly all diagnoses, whereas the absence of TLW leads to the missing of two diagnoses.
  • Figure 5: The influence of $\lambda$ on CE metrics at stage 1.
  • ...and 2 more figures