Table of Contents
Fetching ...

Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning

Zubia Naz, Farhan Asghar, Muhammad Ishfaq Hussain, Yahya Hadadi, Muhammad Aasim Rafique, Wookjin Choi, Moongu Jeon

TL;DR

The paper addresses automated medical image captioning by introducing a regional attention mechanism on a Swin Transformer encoder paired with a BART-based decoder enhanced with PubMedBERT embeddings to produce clinically relevant captions. It demonstrates strong semantic fidelity improvements on the ROCO dataset, especially in ROUGE and BERTScore, while maintaining interpretability through region-level heatmaps. The approach is evaluated with multi-seed stability and ablations, showing robust performance across modalities and providing a human-in-the-loop framework for safe research use. The work advances clinical captioning by combining multi-scale visual encoding, targeted regional emphasis, and biomedical language priors, with plans to extend attention mechanisms and broaden evaluation in future work.

Abstract

Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as mean$\pm$std over three seeds and include $95\%$ confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size $=4$), length penalty $=1.1$, $no\_repeat\_ngram\_size$ $=3$, and max length $=128$. The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.

Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning

TL;DR

The paper addresses automated medical image captioning by introducing a regional attention mechanism on a Swin Transformer encoder paired with a BART-based decoder enhanced with PubMedBERT embeddings to produce clinically relevant captions. It demonstrates strong semantic fidelity improvements on the ROCO dataset, especially in ROUGE and BERTScore, while maintaining interpretability through region-level heatmaps. The approach is evaluated with multi-seed stability and ablations, showing robust performance across modalities and providing a human-in-the-loop framework for safe research use. The work advances clinical captioning by combining multi-scale visual encoding, targeted regional emphasis, and biomedical language priors, with plans to extend attention mechanisms and broaden evaluation in future work.

Abstract

Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as meanstd over three seeds and include confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size ), length penalty , , and max length . The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.

Paper Structure

This paper contains 12 sections, 6 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The proposed medical image captioning network architecture
  • Figure 2: Chest CT Scan taken as case study and its generated captions with respect of ground truth is given below:
  • Figure 3: Cervical spine, CT Scan taken as second case study (its generated captions with respect of ground truth is given below):
  • Figure 4: Abdominal aorta, CT Scan taken as third case and its generated captions with respect of ground truth is given below: