Intensive Vision-guided Network for Radiology Report Generation

Fudan Zheng; Mengfei Li; Ying Wang; Weijiang Yu; Ruixuan Wang; Zhiguang Chen; Nong Xiao; Yutong Lu

Intensive Vision-guided Network for Radiology Report Generation

Fudan Zheng, Mengfei Li, Ying Wang, Weijiang Yu, Ruixuan Wang, Zhiguang Chen, Nong Xiao, Yutong Lu

TL;DR

This work tackles automatic radiology report generation by addressing two core gaps: limited multi-view visual reasoning and lack of adaptive multi-modal guidance in text generation. It introduces the Intensive Vision-guided Network (IVGN), pairing a Globally-intensive Attention (GIA) visual encoder with a Visual Knowledge-guided Decoder (VKGD). The GIA module fuses depth-view, space-view, and pixel-view cues, while VKGD uses attention to integrate previously generated text with region-specific image features during word prediction, enabling more clinically accurate reports. Evaluations on IU X-Ray and MIMIC-CXR show IVGN achieving state-of-the-art or competitive performance across NLG metrics and notably higher clinical efficacy (CE) scores, with fewer parameters and lower FLOPs, indicating practical potential for deployment. The study also provides thorough ablations and qualitative analyses, confirming the value of multi-view visual reasoning and adaptive visual-grounded decoding for robust radiology report generation.

Abstract

Automatic radiology report generation is booming due to its huge application potential for the healthcare industry. However, existing computer vision and natural language processing approaches to tackle this problem are limited in two aspects. First, when extracting image features, most of them neglect multi-view reasoning in vision and model single-view structure of medical images, such as space-view or channel-view. However, clinicians rely on multi-view imaging information for comprehensive judgment in daily clinical diagnosis. Second, when generating reports, they overlook context reasoning with multi-modal information and focus on pure textual optimization utilizing retrieval-based methods. We aim to address these two issues by proposing a model that better simulates clinicians' perspectives and generates more accurate reports. Given the above limitation in feature extraction, we propose a Globally-intensive Attention (GIA) module in the medical image encoder to simulate and integrate multi-view vision perception. GIA aims to learn three types of vision perception: depth view, space view, and pixel view. On the other hand, to address the above problem in report generation, we explore how to involve multi-modal signals to generate precisely matched reports, i.e., how to integrate previously predicted words with region-aware visual content in next word prediction. Specifically, we design a Visual Knowledge-guided Decoder (VKGD), which can adaptively consider how much the model needs to rely on visual information and previously predicted text to assist next word prediction. Hence, our final Intensive Vision-guided Network (IVGN) framework includes a GIA-guided Visual Encoder and the VKGD. Experiments on two commonly-used datasets IU X-Ray and MIMIC-CXR demonstrate the superior ability of our method compared with other state-of-the-art approaches.

Intensive Vision-guided Network for Radiology Report Generation

TL;DR

Abstract

Paper Structure (28 sections, 12 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 28 sections, 12 equations, 10 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Image captioning
Automatic radiology report generation
Method
Overall architecture
Globally-intensive Attention-guided Visual Encoder
Globally-intensive Attention-based visual extractor
Transformer
Visual Knowledge-guided Decoder
Experiments
Datasets
Implementation details
Baselines
Evaluation metrics
...and 13 more sections

Figures (10)

Figure 1: An example of chest X-ray image and its report. The task of automatic radiology report generation is to automatically generate a report based on the given radiology image.
Figure 2: A schematic diagram of attention mechanism from different views. The left and middle subgraphs represent attention modeling only from the space view and channel view, respectively. The model can only learn the importance of different positions in the same space or the importance of different channels. The subgraph on the right represents multi-view attention modeling, in which the model can learn the importance of each position in the feature maps and ultimately guide the model to learn the most salient visual features. Best viewed in color.
Figure 3: Examples of ground-truth reports and templates. It can be seen that when writing radiology reports, clinicians often associate the symptoms they find with diseases, and clearly describe “what diseases are derived from what symptoms”, as shown in the blue sentences and red words. However, templates are often limited to specific locations and descriptions and fail to reveal a causal relationship between the abnormalities found and the diseases. Best viewed in color.
Figure 4: The overall architecture of the proposed Intensive Vision-guided Network (IVGN). One or several radiology images are first visually encoded by the Globally-intensive Attention-guided Visual Encoder, which consists of a visual extractor based on a Globally-intensive Attention (GIA) module and several Transformer encoder layers. Then, the extracted image features are fed into the Visual Knowledge-guided Decoder (VKGD) for final report generation. In this decoder, the importance of image features for different regions and the importance of the previously predicted words are learned through an attention mechanism, resulting in a visual-guided context. Then, the visual-guided context, along with the text embeddings, are sent into a classical LSTM that focuses on mining associations between the previously predicted words and the current image. In the training phase, text embeddings come from the corresponding grouth-truth report of the input images, while in the inference phase, text embeddings refer to embeddings of the previously generated word. The decoder outputs words one by one, and finally forms a complete report. Best viewed in color.
Figure 5: The upper part shows the structure of the GIA module. The features extracted from the last convolution layer of $conv3\_x$ or $conv4\_x$ of ResNet-101 are first passed through a depth-view Batch Normalization-guided Weight Adapter (BNWA) submodule, and then a space-view BNWA submodule, to estimate the importance of depth weight and space weight, respectively. Then, after weighting the learned importance, the new features are then passed through a pixel-view SimAM submodule for importance learning of pixels in the whole feature maps. In this way, the module acquires multi-view attention. Finally, we use a residual structure to alleviate the problem of vanishing gradients C39. The bottom half of the figure are the schematic graphs of the depth-wise BNWA, space-view BNWA and pixel-view SimAM, respectively. The implementation details of these three submodules are described in Section 3.2.1. Best viewed in color.
...and 5 more figures

Intensive Vision-guided Network for Radiology Report Generation

TL;DR

Abstract

Intensive Vision-guided Network for Radiology Report Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)