Multi-modality Regional Alignment Network for Covid X-Ray Survival Prediction and Report Generation

Zhusi Zhong; Jie Li; John Sollee; Scott Collins; Harrison Bai; Paul Zhang; Terrence Healey; Michael Atalay; Xinbo Gao; Zhicheng Jiao

Multi-modality Regional Alignment Network for Covid X-Ray Survival Prediction and Report Generation

Zhusi Zhong, Jie Li, John Sollee, Scott Collins, Harrison Bai, Paul Zhang, Terrence Healey, Michael Atalay, Xinbo Gao, Zhicheng Jiao

TL;DR

This work tackles automated radiology report generation and survival prediction for COVID-19 CXRs by grounding textual descriptions in high-risk anatomical regions. It introduces MRANet, a framework that fuses region detection (Faster R-CNN with a Region Completer), multi-scale region-feature encoding, survival-guided sentence embedding, image-to-text LLM alignment (GatorTron and GPT-2), and a two-stage multi-modal survival predictor. The approach yields region-grounded sentences and prognostic signals, validated on Brown-COVID and Penn-COVID across multiple centers, with improvements in C-index and British-level clinical evaluation metrics. The study contributes to interpretability and trust in AI-assisted radiology by linking visual regions, descriptive text, and survival risk, and suggests directions for further enhancing clinical transparency.

Abstract

In response to the worldwide COVID-19 pandemic, advanced automated technologies have emerged as valuable tools to aid healthcare professionals in managing an increased workload by improving radiology report generation and prognostic analysis. This study proposes Multi-modality Regional Alignment Network (MRANet), an explainable model for radiology report generation and survival prediction that focuses on high-risk regions. By learning spatial correlation in the detector, MRANet visually grounds region-specific descriptions, providing robust anatomical regions with a completion strategy. The visual features of each region are embedded using a novel survival attention mechanism, offering spatially and risk-aware features for sentence encoding while maintaining global coherence across tasks. A cross LLMs alignment is employed to enhance the image-to-text transfer process, resulting in sentences rich with clinical detail and improved explainability for radiologist. Multi-center experiments validate both MRANet's overall performance and each module's composition within the model, encouraging further advancements in radiology report generation research emphasizing clinical interpretation and trustworthiness in AI models applied to medical studies. The code is available at https://github.com/zzs95/MRANet.

Multi-modality Regional Alignment Network for Covid X-Ray Survival Prediction and Report Generation

TL;DR

Abstract

Paper Structure (28 sections, 14 equations, 6 figures, 4 tables)

This paper contains 28 sections, 14 equations, 6 figures, 4 tables.

Introduction
Related work
Survival analysis
Radiology report generation
METHODS
Framework Overview
Anatomical region detection and completion
Multi-scale Region-feature Encoding
Survival-guided Sentence-feature Encoding
Sentence Generation
Image-to-text LLMs-Alignment
Multi-modality Survival Prediction
EXPERIMENT
Datasets
Training and Inference
...and 13 more sections

Figures (6)

Figure 1: Overview of Proposed Method: Anatomical regions detection serves as the foundation for our approach. Focusing on the detected lung region groups (represented in orange), imaging features are selected and aggregated for COVID-19 report generation and survival analysis. The generated sentence and risk score are explicitly grounded within anatomical region groups, providing mutual benefits through risk description and sentence-wise survival consistency.
Figure 2: The overview of the proposed Multi-modality Region Alignment Network (MRANet) comprising of region-based report generation with survival attention. Our framework extracts reports and clinical variables from medical data along with corresponding images for each modality—red, green, and yellow branches represent these data flows. The Multi-scale Region-Feature Encoder (MRE) aggregates local features in predicted anatomical regions to create a comprehensive feature representation for each sentence. Text decoder and encoder are LLMs used to constraint the image-to-text feature space alignment for report generation, which is embedded by Survival-guided Sentence-feature Encoder (SSE). The multi-modality survival prediction predicts the risk prediction and provides the survival-attention to the sentence feature learning in anatomical regions.
Figure 3: Illustration of our proposed Multi-scale region-feature encoder (MRE) and survival-guided Sentence-feature encoder (SSE). The MRE extracts regional features from the multi-scale ResNet backbone, and aggregates by concatenating the grouped features each report sentence. The SSE takes the last visual features through a survival attention module to learn global risks. It utilizes local and global embedders to encode the visual attributes and risk attention contained in the sentences.
Figure 4: The illustration of the anatomical region detector with Faster R-CNN and the proposed Region Completer, which corrects bounding box of the undetected region with the learned spatial coordinate pattern.
Figure 5: An example of Chest X-Ray image with detected anatomical regions and the corresponding structural report. The 4 sentences (Lungs, Pleura, Heart and mediastinum, and Bones) of Findings are the summaries on the regions in 4 groups as illustrated on left 4 figures. The sentence of Impression is the description on whole image.
...and 1 more figures

Multi-modality Regional Alignment Network for Covid X-Ray Survival Prediction and Report Generation

TL;DR

Abstract

Multi-modality Regional Alignment Network for Covid X-Ray Survival Prediction and Report Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)