Table of Contents
Fetching ...

Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning

Yogesh Thakku Suresh, Vishwajeet Shivaji Hogale, Luca-Alexandru Zamfira, Anandavardhana Hegde

TL;DR

This work addresses automated MRI captioning by learning robust vision–language alignment through a transformer-based multimodal framework that combines DEiT-Small for image encoding, MediCareBERT for domain-specific text embeddings, and a lightweight LSTM decoder. A hybrid cosine–MSE loss and contrastive semantics guide training, with domain-targeted data (Brain-Only vs All-MRI) improving caption fidelity. Experiments on MultiCaRe show competitive performance and statistically significant gains over baselines, with ablations highlighting the importance of the DEiT–BERT fusion and the hybrid loss. The approach offers a scalable, interpretable path toward clinical deployment for automated radiology reporting in MRI contexts.

Abstract

We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.

Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning

TL;DR

This work addresses automated MRI captioning by learning robust vision–language alignment through a transformer-based multimodal framework that combines DEiT-Small for image encoding, MediCareBERT for domain-specific text embeddings, and a lightweight LSTM decoder. A hybrid cosine–MSE loss and contrastive semantics guide training, with domain-targeted data (Brain-Only vs All-MRI) improving caption fidelity. Experiments on MultiCaRe show competitive performance and statistically significant gains over baselines, with ablations highlighting the importance of the DEiT–BERT fusion and the hybrid loss. The approach offers a scalable, interpretable path toward clinical deployment for automated radiology reporting in MRI contexts.

Abstract

We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.

Paper Structure

This paper contains 20 sections, 3 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Training Pipeline