Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning
Zubia Naz, Farhan Asghar, Muhammad Ishfaq Hussain, Yahya Hadadi, Muhammad Aasim Rafique, Wookjin Choi, Moongu Jeon
TL;DR
The paper addresses automated medical image captioning by introducing a regional attention mechanism on a Swin Transformer encoder paired with a BART-based decoder enhanced with PubMedBERT embeddings to produce clinically relevant captions. It demonstrates strong semantic fidelity improvements on the ROCO dataset, especially in ROUGE and BERTScore, while maintaining interpretability through region-level heatmaps. The approach is evaluated with multi-seed stability and ablations, showing robust performance across modalities and providing a human-in-the-loop framework for safe research use. The work advances clinical captioning by combining multi-scale visual encoding, targeted regional emphasis, and biomedical language priors, with plans to extend attention mechanisms and broaden evaluation in future work.
Abstract
Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as mean$\pm$std over three seeds and include $95\%$ confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size $=4$), length penalty $=1.1$, $no\_repeat\_ngram\_size$ $=3$, and max length $=128$. The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.
