Table of Contents
Fetching ...

Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization

Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mushfiqur Rahman, Niloy Kumar Mondal, Md. Mehedi Hasan Shawon, Md Rakibul Hasan

Abstract

Automated radiology report summarization aims to distill verbose findings into concise clinical impressions, but existing multimodal models often struggle with visual noise and fail to meaningfully improve over strong text-only baselines in the FINDINGS $\to$ IMPRESSION transformation. We challenge two prevailing assumptions: (1) that more visual input is always better, and (2) that multimodal models add limited value when findings already contain rich image-derived detail. Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance. We introduce ViTAS, Visual-Text Attention Summarizer, a multi-stage pipeline that combines ensemble-guided MedSAM2 lung segmentation, bidirectional cross-attention for multi-view fusion, Shapley-guided adaptive patch clustering, and hierarchical visual tokenization feeding a ViT. ViTAS achieves SOTA results with 29.25% BLEU-4 and 69.83% ROUGE-L, improved factual alignment in qualitative analysis, and the highest expert-rated human evaluation scores. Our findings demonstrate that less but more relevant visual input is not only sufficient but superior for multimodal radiology summarization.

Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization

Abstract

Automated radiology report summarization aims to distill verbose findings into concise clinical impressions, but existing multimodal models often struggle with visual noise and fail to meaningfully improve over strong text-only baselines in the FINDINGS IMPRESSION transformation. We challenge two prevailing assumptions: (1) that more visual input is always better, and (2) that multimodal models add limited value when findings already contain rich image-derived detail. Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance. We introduce ViTAS, Visual-Text Attention Summarizer, a multi-stage pipeline that combines ensemble-guided MedSAM2 lung segmentation, bidirectional cross-attention for multi-view fusion, Shapley-guided adaptive patch clustering, and hierarchical visual tokenization feeding a ViT. ViTAS achieves SOTA results with 29.25% BLEU-4 and 69.83% ROUGE-L, improved factual alignment in qualitative analysis, and the highest expert-rated human evaluation scores. Our findings demonstrate that less but more relevant visual input is not only sufficient but superior for multimodal radiology summarization.

Paper Structure

This paper contains 18 sections, 11 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of the ViTAS pipeline: Chest X-rays (frontal and lateral) are lung-segmented using MedSAM2, then fused via a dual SwinV2 with cross-attention. Interpretability modules identify important regions, which are selectively tokenized and combined with text embeddings in a T5 decoder to generate the final clinical impression.
  • Figure 2: Ensemble-guided MedSAM2 lung segmentation pipeline. A reference box generates five shifted bounding boxes. MedSAM2 produces binary masks which are unioned into the final lung ROI mask. The segmented image retains only pulmonary fields.
  • Figure 3: Dual Swin Transformer V2 with bidirectional mid-fusion cross-attention. Frontal and lateral views are processed by separate backbones, exchange information via cross-attention, and their global features are fused before classification.
  • Figure 4: Attention-driven patch selection pipeline (frontal top row, lateral bottom row): original X-ray, Swin V2 32$\times$32 heatmap (downsampled from 64$\times$64), overlay on MedSAM2-segmented lungs, DBSCAN clusters (15 frontal, 8 lateral), projected ViT 14$\times$14 heatmap, and final selected clusters (8 frontal, 5 lateral). Only these pathology-dominant patches feed the multimodal T5 decoder.
  • Figure 5: Qualitative analysis of left lung opacification. Ground truth shows increasing opacification. Full-image and ROI models detect pneumonia with suggested follow-up, while the ViTAS ROI-patch model closely matches ground truth, capturing pneumonia and possible lymphangitic spread with suggestions.