Knowledge-grounded Adaptation Strategy for Vision-language Models: Building Unique Case-set for Screening Mammograms for Residents Training

Aisha Urooj Khan; John Garrett; Tyler Bradshaw; Lonie Salkowski; Jiwoong Jason Jeong; Amara Tariq; Imon Banerjee

Knowledge-grounded Adaptation Strategy for Vision-language Models: Building Unique Case-set for Screening Mammograms for Residents Training

Aisha Urooj Khan, John Garrett, Tyler Bradshaw, Lonie Salkowski, Jiwoong Jason Jeong, Amara Tariq, Imon Banerjee

TL;DR

The paper addresses domain adaptation of vision-language models to mammography by introducing a knowledge-grounded selective sampling framework built on BIRADS-based concept extraction, concept-based grouping, and balanced negative mining during pretraining and few-shot fine-tuning for ALBEF and MedCLIP. It demonstrates that carefully constructed minibatches—balancing rare and frequent concept groups while ensuring true negatives—improves image-to-text and text-to-image retrieval across internal and external mammography datasets, with Recall@$K$ gains, notably for ALBEF. Ablation studies reveal the impact of batch size, group distribution, and recalibration of frequent groups on performance, guiding practical training in imbalanced medical data. Overall, the approach provides a scalable pathway for domain-specific adaptation of multimodal models in radiology, with implications for enhancing resident training through targeted case retrieval, while highlighting room for improvement in cross-domain generalization for certain architectures like MedCLIP.

Abstract

A visual-language model (VLM) pre-trained on natural images and text pairs poses a significant barrier when applied to medical contexts due to domain shift. Yet, adapting or fine-tuning these VLMs for medical use presents considerable hurdles, including domain misalignment, limited access to extensive datasets, and high-class imbalances. Hence, there is a pressing need for strategies to effectively adapt these VLMs to the medical domain, as such adaptations would prove immensely valuable in healthcare applications. In this study, we propose a framework designed to adeptly tailor VLMs to the medical domain, employing selective sampling and hard-negative mining techniques for enhanced performance in retrieval tasks. We validate the efficacy of our proposed approach by implementing it across two distinct VLMs: the in-domain VLM (MedCLIP) and out-of-domain VLMs (ALBEF). We assess the performance of these models both in their original off-the-shelf state and after undergoing our proposed training strategies, using two extensive datasets containing mammograms and their corresponding reports. Our evaluation spans zero-shot, few-shot, and supervised scenarios. Through our approach, we observe a notable enhancement in Recall@K performance for the image-text retrieval task.

Knowledge-grounded Adaptation Strategy for Vision-language Models: Building Unique Case-set for Screening Mammograms for Residents Training

TL;DR

gains, notably for ALBEF. Ablation studies reveal the impact of batch size, group distribution, and recalibration of frequent groups on performance, guiding practical training in imbalanced medical data. Overall, the approach provides a scalable pathway for domain-specific adaptation of multimodal models in radiology, with implications for enhancing resident training through targeted case retrieval, while highlighting room for improvement in cross-domain generalization for certain architectures like MedCLIP.

Abstract

Paper Structure (25 sections, 7 figures, 5 tables)

This paper contains 25 sections, 7 figures, 5 tables.

Introduction
Methodology
1)Knowledge extraction:
2) Knowledge grounded grouping:
3) Selective Sampling:
VLM Training
Experiments and Results
Datasets:
Implementation Details:
Ablations and Analyses:
Discussion and Conclusion
Related Work
Evaluation Metrics
More details about datasets:
Additional Implementation Details
...and 10 more sections

Figures (7)

Figure 1: Multimodal learning for screening mammogram: (a) a session with radiology resident for the case review; (b) framework generating joint embedding space for bilateral mammogram and free-text radiology reports. Illustration of joint embedding space (right) is adapted from CrossCLR zolfaghari2021crossclr.
Figure 1: Groups distribution for internal (institute X) and external (institute Y) test sets. For both test sets, top 3 groups belong to breast composition. Breast tissue composition could be scattered fibroglandular (S), heterogeneous (H), fatty (F), and extreme dense (E). Short forms are used for asymmetry (asymm) and calcifications (calc).
Figure 2: Workflow for adapting the VLM with the proposed selective sampling to learn joint representation aware of fine-grained knowledge. The pretrained model is tested on out-of-domain data for zero shot evaluation. For few shot learning, support set is obtained from the training data to fine-tune model.
Figure 2: Loss curves for image-text alignment loss in ALBEF ALBEF. Left) vanilla ALBEF trained on internal dataset, Right) ALBEF after using proposed selective sampling.
Figure 3: Qualitative results for Retrieval model. An example with highlighted green words is marked relevant by the radiologist for case build. Concepts highlighted with the pink show not exact but related findings in the image-report pair.
...and 2 more figures

Knowledge-grounded Adaptation Strategy for Vision-language Models: Building Unique Case-set for Screening Mammograms for Residents Training

TL;DR

Abstract

Knowledge-grounded Adaptation Strategy for Vision-language Models: Building Unique Case-set for Screening Mammograms for Residents Training

Authors

TL;DR

Abstract

Table of Contents

Figures (7)