Disease-informed Adaptation of Vision-Language Models
Jiajin Zhang, Ge Wang, Mannudeep K. Kalra, Pingkun Yan
TL;DR
This work tackles the challenge of adapting pretrained vision-language models to underrepresented or novel diseases in medical imaging under severe data scarcity. It introduces a disease-informed adaptation framework comprising DiCoP and DPL: DiCoP builds disease-specific prompts grounded in clinical attributes (texture, location, shape) and enriches them with image context via an image feature projector, while DPL learns disease prototypes with geometric regularization to shape a well-structured latent space. The model is trained with three losses, $L_{ita}$, $L_{prot}$, and $L_{reg-ce}$, to align image and prompt representations and to enforce prototype intra-class cohesion and inter-class separation, with only the vision branch used for inference. Empirical results on PanNuke and COVID-x show substantial gains over adapter- and prompting-based baselines, especially with limited labeled data, and ablations validate the contribution of each component. Overall, the approach offers a practical, data-efficient pathway to deploy clinically informed VLMs across diverse medical imaging tasks and diseases.
Abstract
In medical image analysis, the expertise scarcity and the high cost of data annotation limits the development of large artificial intelligence models. This paper investigates the potential of transfer learning with pre-trained vision-language models (VLMs) in this domain. Currently, VLMs still struggle to transfer to the underrepresented diseases with minimal presence and new diseases entirely absent from the pretraining dataset. We argue that effective adaptation of VLMs hinges on the nuanced representation learning of disease concepts. By capitalizing on the joint visual-linguistic capabilities of VLMs, we introduce disease-informed contextual prompting in a novel disease prototype learning framework. This approach enables VLMs to grasp the concepts of new disease effectively and efficiently, even with limited data. Extensive experiments across multiple image modalities showcase notable enhancements in performance compared to existing techniques.
