Table of Contents
Fetching ...

TRRG: Towards Truthful Radiology Report Generation With Cross-modal Disease Clue Enhanced Large Language Model

Yuhao Wang, Chao Hao, Yawen Cui, Xinqi Su, Weicheng Xie, Tao Tan, Zitong Yu

TL;DR

TRRG tackles truthful radiology report generation by addressing imbalanced disease supervision and misalignment between radiographs and reports. It introduces a two-stage framework: (1) sentence-level contrastive pretraining to align the vision encoder with disease-focused textual cues, and (2) clue-enhanced instruct-tuning that injects disease clues via a dedicated module and cross-modal interaction, guided by a disease-aware consistency loss. The method achieves state-of-the-art results on MIMIC-CXR and strong performance on IU-Xray, with ablations confirming the contributions of the disease clue injection, cross-modal interaction, and the disease-aware loss. These advances improve both linguistic quality and clinical effectiveness, enabling more truthful and actionable radiology reports, and hold promise for extending to other medical imaging modalities.

Abstract

The vision-language modeling capability of multi-modal large language models has attracted wide attention from the community. However, in medical domain, radiology report generation using vision-language models still faces significant challenges due to the imbalanced data distribution caused by numerous negated descriptions in radiology reports and issues such as rough alignment between radiology reports and radiography. In this paper, we propose a truthful radiology report generation framework, namely TRRG, based on stage-wise training for cross-modal disease clue injection into large language models. In pre-training stage, During the pre-training phase, contrastive learning is employed to enhance the ability of visual encoder to perceive fine-grained disease details. In fine-tuning stage, the clue injection module we proposed significantly enhances the disease-oriented perception capability of the large language model by effectively incorporating the robust zero-shot disease perception. Finally, through the cross-modal clue interaction module, our model effectively achieves the multi-granular interaction of visual embeddings and an arbitrary number of disease clue embeddings. This significantly enhances the report generation capability and clinical effectiveness of multi-modal large language models in the field of radiology reportgeneration. Experimental results demonstrate that our proposed pre-training and fine-tuning framework achieves state-of-the-art performance in radiology report generation on datasets such as IU-Xray and MIMIC-CXR. Further analysis indicates that our proposed method can effectively enhance the model to perceive diseases and improve its clinical effectiveness.

TRRG: Towards Truthful Radiology Report Generation With Cross-modal Disease Clue Enhanced Large Language Model

TL;DR

TRRG tackles truthful radiology report generation by addressing imbalanced disease supervision and misalignment between radiographs and reports. It introduces a two-stage framework: (1) sentence-level contrastive pretraining to align the vision encoder with disease-focused textual cues, and (2) clue-enhanced instruct-tuning that injects disease clues via a dedicated module and cross-modal interaction, guided by a disease-aware consistency loss. The method achieves state-of-the-art results on MIMIC-CXR and strong performance on IU-Xray, with ablations confirming the contributions of the disease clue injection, cross-modal interaction, and the disease-aware loss. These advances improve both linguistic quality and clinical effectiveness, enabling more truthful and actionable radiology reports, and hold promise for extending to other medical imaging modalities.

Abstract

The vision-language modeling capability of multi-modal large language models has attracted wide attention from the community. However, in medical domain, radiology report generation using vision-language models still faces significant challenges due to the imbalanced data distribution caused by numerous negated descriptions in radiology reports and issues such as rough alignment between radiology reports and radiography. In this paper, we propose a truthful radiology report generation framework, namely TRRG, based on stage-wise training for cross-modal disease clue injection into large language models. In pre-training stage, During the pre-training phase, contrastive learning is employed to enhance the ability of visual encoder to perceive fine-grained disease details. In fine-tuning stage, the clue injection module we proposed significantly enhances the disease-oriented perception capability of the large language model by effectively incorporating the robust zero-shot disease perception. Finally, through the cross-modal clue interaction module, our model effectively achieves the multi-granular interaction of visual embeddings and an arbitrary number of disease clue embeddings. This significantly enhances the report generation capability and clinical effectiveness of multi-modal large language models in the field of radiology reportgeneration. Experimental results demonstrate that our proposed pre-training and fine-tuning framework achieves state-of-the-art performance in radiology report generation on datasets such as IU-Xray and MIMIC-CXR. Further analysis indicates that our proposed method can effectively enhance the model to perceive diseases and improve its clinical effectiveness.
Paper Structure (22 sections, 23 equations, 5 figures, 4 tables)

This paper contains 22 sections, 23 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The training strategy of our proposed TRRG
  • Figure 2: During the fine-tuning stage, the visual encoder and the clue encoder are frozen, and disease clues are injected simultaneously through the clue injection module. In this stage, visual embeddings processed by the visual mapper interact with disease clue embeddings through cross-modal clue interaction. Finally, the frozen large language model is fine-tuned through instruction-based fine-tuning to achieve medical image report generation.
  • Figure 3: Architecture of Clue Injection Module, HP and FL represent Hadamard Product and Flatten, respectively.
  • Figure 4: Architecture of Cross Modal Clue Interaction Module
  • Figure 5: We compare the generated results of the base model and the TRRG (Ours) with the ground truth, highlighting key information using colored fonts. Our model effectively generated specific descriptions tailored to diseases.