TRRG: Towards Truthful Radiology Report Generation With Cross-modal Disease Clue Enhanced Large Language Model

Yuhao Wang; Chao Hao; Yawen Cui; Xinqi Su; Weicheng Xie; Tao Tan; Zitong Yu

TRRG: Towards Truthful Radiology Report Generation With Cross-modal Disease Clue Enhanced Large Language Model

Yuhao Wang, Chao Hao, Yawen Cui, Xinqi Su, Weicheng Xie, Tao Tan, Zitong Yu

TL;DR

TRRG tackles truthful radiology report generation by addressing imbalanced disease supervision and misalignment between radiographs and reports. It introduces a two-stage framework: (1) sentence-level contrastive pretraining to align the vision encoder with disease-focused textual cues, and (2) clue-enhanced instruct-tuning that injects disease clues via a dedicated module and cross-modal interaction, guided by a disease-aware consistency loss. The method achieves state-of-the-art results on MIMIC-CXR and strong performance on IU-Xray, with ablations confirming the contributions of the disease clue injection, cross-modal interaction, and the disease-aware loss. These advances improve both linguistic quality and clinical effectiveness, enabling more truthful and actionable radiology reports, and hold promise for extending to other medical imaging modalities.

Abstract

The vision-language modeling capability of multi-modal large language models has attracted wide attention from the community. However, in medical domain, radiology report generation using vision-language models still faces significant challenges due to the imbalanced data distribution caused by numerous negated descriptions in radiology reports and issues such as rough alignment between radiology reports and radiography. In this paper, we propose a truthful radiology report generation framework, namely TRRG, based on stage-wise training for cross-modal disease clue injection into large language models. In pre-training stage, During the pre-training phase, contrastive learning is employed to enhance the ability of visual encoder to perceive fine-grained disease details. In fine-tuning stage, the clue injection module we proposed significantly enhances the disease-oriented perception capability of the large language model by effectively incorporating the robust zero-shot disease perception. Finally, through the cross-modal clue interaction module, our model effectively achieves the multi-granular interaction of visual embeddings and an arbitrary number of disease clue embeddings. This significantly enhances the report generation capability and clinical effectiveness of multi-modal large language models in the field of radiology reportgeneration. Experimental results demonstrate that our proposed pre-training and fine-tuning framework achieves state-of-the-art performance in radiology report generation on datasets such as IU-Xray and MIMIC-CXR. Further analysis indicates that our proposed method can effectively enhance the model to perceive diseases and improve its clinical effectiveness.

TRRG: Towards Truthful Radiology Report Generation With Cross-modal Disease Clue Enhanced Large Language Model

TL;DR

Abstract

Paper Structure (22 sections, 23 equations, 5 figures, 4 tables)

This paper contains 22 sections, 23 equations, 5 figures, 4 tables.

Introduction
Related work
Radiology Report Generation
Multi modal Large Language Model
Vision Language Pretraining
Methods
Stage 1: Disease-aware Cross-modal Fine Gained Alignment
Stage 2: Clue Enhanced Instruct Tuning on Radiology Report Generation
Visual Embedding
Clue Injection Module
Cross Modal Clue Interaction
Optimization Objective
Experiment
Datasets and Evaluation Metrics
Datasets
...and 7 more sections

Figures (5)

Figure 1: The training strategy of our proposed TRRG
Figure 2: During the fine-tuning stage, the visual encoder and the clue encoder are frozen, and disease clues are injected simultaneously through the clue injection module. In this stage, visual embeddings processed by the visual mapper interact with disease clue embeddings through cross-modal clue interaction. Finally, the frozen large language model is fine-tuned through instruction-based fine-tuning to achieve medical image report generation.
Figure 3: Architecture of Clue Injection Module, HP and FL represent Hadamard Product and Flatten, respectively.
Figure 4: Architecture of Cross Modal Clue Interaction Module
Figure 5: We compare the generated results of the base model and the TRRG (Ours) with the ground truth, highlighting key information using colored fonts. Our model effectively generated specific descriptions tailored to diseases.

TRRG: Towards Truthful Radiology Report Generation With Cross-modal Disease Clue Enhanced Large Language Model

TL;DR

Abstract

TRRG: Towards Truthful Radiology Report Generation With Cross-modal Disease Clue Enhanced Large Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (5)