Knowledge-grounded Adaptation Strategy for Vision-language Models: Building Unique Case-set for Screening Mammograms for Residents Training
Aisha Urooj Khan, John Garrett, Tyler Bradshaw, Lonie Salkowski, Jiwoong Jason Jeong, Amara Tariq, Imon Banerjee
TL;DR
The paper addresses domain adaptation of vision-language models to mammography by introducing a knowledge-grounded selective sampling framework built on BIRADS-based concept extraction, concept-based grouping, and balanced negative mining during pretraining and few-shot fine-tuning for ALBEF and MedCLIP. It demonstrates that carefully constructed minibatches—balancing rare and frequent concept groups while ensuring true negatives—improves image-to-text and text-to-image retrieval across internal and external mammography datasets, with Recall@$K$ gains, notably for ALBEF. Ablation studies reveal the impact of batch size, group distribution, and recalibration of frequent groups on performance, guiding practical training in imbalanced medical data. Overall, the approach provides a scalable pathway for domain-specific adaptation of multimodal models in radiology, with implications for enhancing resident training through targeted case retrieval, while highlighting room for improvement in cross-domain generalization for certain architectures like MedCLIP.
Abstract
A visual-language model (VLM) pre-trained on natural images and text pairs poses a significant barrier when applied to medical contexts due to domain shift. Yet, adapting or fine-tuning these VLMs for medical use presents considerable hurdles, including domain misalignment, limited access to extensive datasets, and high-class imbalances. Hence, there is a pressing need for strategies to effectively adapt these VLMs to the medical domain, as such adaptations would prove immensely valuable in healthcare applications. In this study, we propose a framework designed to adeptly tailor VLMs to the medical domain, employing selective sampling and hard-negative mining techniques for enhanced performance in retrieval tasks. We validate the efficacy of our proposed approach by implementing it across two distinct VLMs: the in-domain VLM (MedCLIP) and out-of-domain VLMs (ALBEF). We assess the performance of these models both in their original off-the-shelf state and after undergoing our proposed training strategies, using two extensive datasets containing mammograms and their corresponding reports. Our evaluation spans zero-shot, few-shot, and supervised scenarios. Through our approach, we observe a notable enhancement in Recall@K performance for the image-text retrieval task.
