Table of Contents
Fetching ...

MoVL:Exploring Fusion Strategies for the Domain-Adaptive Application of Pretrained Models in Medical Imaging Tasks

Haijiang Tian, Jingkun Yue, Xiaohong Liu, Guoxing Yang, Zeyu Jiang, Guangyu Wang

TL;DR

This work addresses adapting natural-pretrained vision models to medical imaging where data are scarce and domain mismatch is pronounced. It proposes MoVL, a hybrid framework that jointly trains Visual Prompting (VP) and Linear Probe (LP) while keeping the backbone frozen, and introduces a joint loss that combines categorization with a discrepancy term reflecting original-image information. Empirical results across four medical datasets show MoVL achieves competitive ID performance and outperforms full finetuning on an out-of-distribution (OOD) dataset, highlighting the method’s robustness to distribution shifts. The findings suggest that jointly optimizing input and output alignment with a carefully designed loss and initialization strategy can effectively adapt large pretrained models to medical imaging tasks with limited data and minimal architectural changes.

Abstract

Medical images are often more difficult to acquire than natural images due to the specialism of the equipment and technology, which leads to less medical image datasets. So it is hard to train a strong pretrained medical vision model. How to make the best of natural pretrained vision model and adapt in medical domain still pends. For image classification, a popular method is linear probe (LP). However, LP only considers the output after feature extraction. Yet, there exists a gap between input medical images and natural pretrained vision model. We introduce visual prompting (VP) to fill in the gap, and analyze the strategies of coupling between LP and VP. We design a joint learning loss function containing categorisation loss and discrepancy loss, which describe the variance of prompted and plain images, naming this joint training strategy MoVL (Mixture of Visual Prompting and Linear Probe). We experiment on 4 medical image classification datasets, with two mainstream architectures, ResNet and CLIP. Results shows that without changing the parameters and architecture of backbone model and with less parameters, there is potential for MoVL to achieve full finetune (FF) accuracy (on four medical datasets, average 90.91% for MoVL and 91.13% for FF). On out of distribution medical dataset, our method(90.33%) can outperform FF (85.15%) with absolute 5.18 % lead.

MoVL:Exploring Fusion Strategies for the Domain-Adaptive Application of Pretrained Models in Medical Imaging Tasks

TL;DR

This work addresses adapting natural-pretrained vision models to medical imaging where data are scarce and domain mismatch is pronounced. It proposes MoVL, a hybrid framework that jointly trains Visual Prompting (VP) and Linear Probe (LP) while keeping the backbone frozen, and introduces a joint loss that combines categorization with a discrepancy term reflecting original-image information. Empirical results across four medical datasets show MoVL achieves competitive ID performance and outperforms full finetuning on an out-of-distribution (OOD) dataset, highlighting the method’s robustness to distribution shifts. The findings suggest that jointly optimizing input and output alignment with a carefully designed loss and initialization strategy can effectively adapt large pretrained models to medical imaging tasks with limited data and minimal architectural changes.

Abstract

Medical images are often more difficult to acquire than natural images due to the specialism of the equipment and technology, which leads to less medical image datasets. So it is hard to train a strong pretrained medical vision model. How to make the best of natural pretrained vision model and adapt in medical domain still pends. For image classification, a popular method is linear probe (LP). However, LP only considers the output after feature extraction. Yet, there exists a gap between input medical images and natural pretrained vision model. We introduce visual prompting (VP) to fill in the gap, and analyze the strategies of coupling between LP and VP. We design a joint learning loss function containing categorisation loss and discrepancy loss, which describe the variance of prompted and plain images, naming this joint training strategy MoVL (Mixture of Visual Prompting and Linear Probe). We experiment on 4 medical image classification datasets, with two mainstream architectures, ResNet and CLIP. Results shows that without changing the parameters and architecture of backbone model and with less parameters, there is potential for MoVL to achieve full finetune (FF) accuracy (on four medical datasets, average 90.91% for MoVL and 91.13% for FF). On out of distribution medical dataset, our method(90.33%) can outperform FF (85.15%) with absolute 5.18 % lead.
Paper Structure (18 sections, 5 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 5 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: The overview of MoVL. VP and LP are trainable, and pretrained model is frozen. $P_{-VP}$ and $P_{+VP}$ are both computed in loss, and use back forward propogation to update LP and VP. $P_{-VP}$ is detached from computational graph. Red and blue lines are forward process, purple lines show the backward propogation direction.
  • Figure 2: Limitations of LP and VP. (a) shows the gap from input medical images to natural pretrained model; (b) shows the gap from output labels to specific ground truth labels;
  • Figure 3: Three different training strategies. The left shows that first train LP and then training VP; The middle shows that first train LP and then training LP and VP together; The right show that train LP and VP together during the full training epochs.