MedBLIP: Fine-tuning BLIP for Medical Image Captioning
Manshi Limbu, Diwita Banerjee
TL;DR
MedBLIP investigates fine-tuning BLIP for radiology image captioning by adapting it to the ROCO dataset and comparing against zero-shot BLIP, BLIP-2, BLIP-2 Instruct, Gemini 1.5 Flash, and ViT-GPT2. The study leverages attention visualization to assess grounding and conducts an ablation comparing encoder-only, decoder-only, and full fine-tuning to understand efficiency–accuracy trade-offs. The results show domain-specific fine-tuning improves lexical and semantic alignment on standard metrics, with decoder-only tuning offering substantial training-time savings while full fine-tuning yields the best performance when resources permit. However, qualitative and clinical evaluations reveal that improvements on automated metrics do not guarantee medical accuracy, highlighting the need for medically grounded evaluation and safeguards in deployment, and pointing to future work such as LoRA adaptations and knowledge-infused, safety-aware architectures.
Abstract
Medical image captioning is a challenging task that requires generating clinically accurate and semantically meaningful descriptions of radiology images. While recent vision-language models (VLMs) such as BLIP, BLIP2, Gemini and ViT-GPT2 show strong performance on natural image datasets, they often produce generic or imprecise captions when applied to specialized medical domains. In this project, we explore the effectiveness of fine-tuning the BLIP model on the ROCO dataset for improved radiology captioning. We compare the fine-tuned BLIP against its zero-shot version, BLIP-2 base, BLIP-2 Instruct and a ViT-GPT2 transformer baseline. Our results demonstrate that domain-specific fine-tuning on BLIP significantly improves performance across both quantitative and qualitative evaluation metrics. We also visualize decoder cross-attention maps to assess interpretability and conduct an ablation study to evaluate the contributions of encoder-only and decoder-only fine-tuning. Our findings highlight the importance of targeted adaptation for medical applications and suggest that decoder-only fine-tuning (encoder-frozen) offers a strong performance baseline with 5% lower training time than full fine-tuning, while full model fine-tuning still yields the best results overall.
