A Survey of Deep Learning-based Radiology Report Generation Using Multimodal Data

Xinyi Wang; Grazziela Figueredo; Ruizhe Li; Wei Emma Zhang; Weitong Chen; Xin Chen

A Survey of Deep Learning-based Radiology Report Generation Using Multimodal Data

Xinyi Wang, Grazziela Figueredo, Ruizhe Li, Wei Emma Zhang, Weitong Chen, Xin Chen

TL;DR

This survey analyzes deep-learning approaches for automatic radiology report generation from multi-modal data across 2021–2024, detailing a five-step workflow (data acquisition, preparation, feature learning, fusion, and generation) and contrasting traditional DL methods with large-model approaches. It highlights trends toward Transformer-based image feature learning, knowledge integration via graphs, and memory-augmented decoding, while noting that language quality improvements from large models do not always translate into higher medical correctness. The article also reviews datasets, evaluation metrics (including RadGraph-based and LLM-based methods), explainability techniques, and benchmarks such as MIMIC-CXR and ReXrank, and discusses challenges in multi-modal data construction, standardized evaluation, and human-AI interaction. Overall, the work underscores the potential of multimodal fusion and large models for radiology report generation while calling for unified benchmarks, improved medical correctness evaluation, and better explainability to enable clinical deployment.

Abstract

Automatic radiology report generation can alleviate the workload for physicians and minimize regional disparities in medical resources, therefore becoming an important topic in the medical image analysis field. It is a challenging task, as the computational model needs to mimic physicians to obtain information from multi-modal input data (i.e., medical images, clinical information, medical knowledge, etc.), and produce comprehensive and accurate reports. Recently, numerous works have emerged to address this issue using deep-learning-based methods, such as transformers, contrastive learning, and knowledge-base construction. This survey summarizes the key techniques developed in the most recent works and proposes a general workflow for deep-learning-based report generation with five main components, including multi-modality data acquisition, data preparation, feature learning, feature fusion and interaction, and report generation. The state-of-the-art methods for each of these components are highlighted. Additionally, we summarize the latest developments in large model-based methods and model explainability, along with public datasets, evaluation methods, current challenges, and future directions in this field. We have also conducted a quantitative comparison between different methods in the same experimental setting. This is the most up-to-date survey that focuses on multi-modality inputs and data fusion for radiology report generation. The aim is to provide comprehensive and rich information for researchers interested in automatic clinical report generation and medical image analysis, especially when using multimodal inputs, and to assist them in developing new algorithms to advance the field.

A Survey of Deep Learning-based Radiology Report Generation Using Multimodal Data

TL;DR

Abstract

Paper Structure (41 sections, 7 figures, 5 tables)

This paper contains 41 sections, 7 figures, 5 tables.

Introduction
Search and selection of articles
Methods
Multi-modality inputs acquisition
Data preparation
Feature learning
Image-based feature learning
Architectures for image feature extraction
Auxiliary loss functions for image feature extraction
Enhancement modules for image feature extraction
Non-imaging-based feature learning
Multi-modal feature fusion and interaction
Report generation
Decoder-based techniques for report generation
Architectures for decoder-based techniques
...and 26 more sections

Figures (7)

Figure 1: The distributions of reviewed papers using image data and multi-modality data as inputs per year from 2021 to 2024. The percentage denotes the input's prevalence among articles published within the year.
Figure 2: An overview of report generation: workflow and taxonomy of employed approaches. a. The typical workflow of automatic radiology report generation. b. The summary of the usage of techniques in the reviewed papers for each step. (x%) represents the percentage of these articles relative to the 100 papers reviewed.
Figure 3: A summary of the types, examples, and sources of multi-modal input data.
Figure 4: Three typical methods for extracting case-related information. (a) Extracting relevant textual reports from the training set by comparing the similarity between the input image and images in the training set. (b) Extracting relevant medical descriptions from a concepts library based on the image's annotated keywords. (c) Automatically extracting entities (e.g., anatomy) from the reports retrieved in (a) and using these entities to search the knowledge base for related knowledge graphs. "..." refers to concepts such as these. The figure only lists a few examples from the training dataset, keyword labels of images, medical descriptions, recognized entities in the report, and knowledge base.
Figure 5: The statistics of the reviewed papers using different architectures to extract image features per year from 2021 to 2024. The percentage denotes the method's prevalence among articles published within the year.
...and 2 more figures

A Survey of Deep Learning-based Radiology Report Generation Using Multimodal Data

TL;DR

Abstract

A Survey of Deep Learning-based Radiology Report Generation Using Multimodal Data

Authors

TL;DR

Abstract

Table of Contents

Figures (7)