Table of Contents
Fetching ...

Exploring Low-Resource Medical Image Classification with Weakly Supervised Prompt Learning

Fudan Zheng, Jindong Cao, Weijiang Yu, Zhiguang Chen, Nong Xiao, Yutong Lu

TL;DR

This work tackles the high annotation burden in medical image classification by introducing MedPrompt, a two-stage framework that first performs unsupervised vision-language pre-training on large-scale medical image-text data and then learns an instance-adaptive, weakly supervised prompt generator. The prompt generator, powered by a Meta-Net and context/class embeddings, automatically produces prompts that align with image embeddings to enable zero-shot and few-shot classification without heavy expert-designed prompts. Across four chest X-ray datasets, MedPrompt outperforms hand-crafted prompts in zero-shot and demonstrates strong few-shot performance, with a lightweight prompt module that can be embedded in various architectures. The approach reduces reliance on domain experts and offers scalable, adaptable medical image recognition in low-resource settings.

Abstract

Most advances in medical image recognition supporting clinical auxiliary diagnosis meet challenges due to the low-resource situation in the medical field, where annotations are highly expensive and professional. This low-resource problem can be alleviated by leveraging the transferable representations of large-scale pre-trained vision-language models via relevant medical text prompts. However, existing pre-trained vision-language models require domain experts to carefully design the medical prompts, which greatly increases the burden on clinicians. To address this problem, we propose a weakly supervised prompt learning method MedPrompt to automatically generate medical prompts, which includes an unsupervised pre-trained vision-language model and a weakly supervised prompt learning model. The unsupervised pre-trained vision-language model utilizes the natural correlation between medical images and corresponding medical texts for pre-training, without any manual annotations. The weakly supervised prompt learning model only utilizes the classes of images in the dataset to guide the learning of the specific class vector in the prompt, while the learning of other context vectors in the prompt requires no manual annotations for guidance. To the best of our knowledge, this is the first model to automatically generate medical prompts. With these prompts, the pre-trained vision-language model can be freed from the strong expert dependency of manual annotation and manual prompt design. Experimental results show that the model using our automatically generated prompts outperforms its full-shot learning hand-crafted prompts counterparts with only a minimal number of labeled samples for few-shot learning, and reaches superior or comparable accuracy on zero-shot image classification. The proposed prompt generator is lightweight and therefore can be embedded into any network architecture.

Exploring Low-Resource Medical Image Classification with Weakly Supervised Prompt Learning

TL;DR

This work tackles the high annotation burden in medical image classification by introducing MedPrompt, a two-stage framework that first performs unsupervised vision-language pre-training on large-scale medical image-text data and then learns an instance-adaptive, weakly supervised prompt generator. The prompt generator, powered by a Meta-Net and context/class embeddings, automatically produces prompts that align with image embeddings to enable zero-shot and few-shot classification without heavy expert-designed prompts. Across four chest X-ray datasets, MedPrompt outperforms hand-crafted prompts in zero-shot and demonstrates strong few-shot performance, with a lightweight prompt module that can be embedded in various architectures. The approach reduces reliance on domain experts and offers scalable, adaptable medical image recognition in low-resource settings.

Abstract

Most advances in medical image recognition supporting clinical auxiliary diagnosis meet challenges due to the low-resource situation in the medical field, where annotations are highly expensive and professional. This low-resource problem can be alleviated by leveraging the transferable representations of large-scale pre-trained vision-language models via relevant medical text prompts. However, existing pre-trained vision-language models require domain experts to carefully design the medical prompts, which greatly increases the burden on clinicians. To address this problem, we propose a weakly supervised prompt learning method MedPrompt to automatically generate medical prompts, which includes an unsupervised pre-trained vision-language model and a weakly supervised prompt learning model. The unsupervised pre-trained vision-language model utilizes the natural correlation between medical images and corresponding medical texts for pre-training, without any manual annotations. The weakly supervised prompt learning model only utilizes the classes of images in the dataset to guide the learning of the specific class vector in the prompt, while the learning of other context vectors in the prompt requires no manual annotations for guidance. To the best of our knowledge, this is the first model to automatically generate medical prompts. With these prompts, the pre-trained vision-language model can be freed from the strong expert dependency of manual annotation and manual prompt design. Experimental results show that the model using our automatically generated prompts outperforms its full-shot learning hand-crafted prompts counterparts with only a minimal number of labeled samples for few-shot learning, and reaches superior or comparable accuracy on zero-shot image classification. The proposed prompt generator is lightweight and therefore can be embedded into any network architecture.
Paper Structure (26 sections, 10 equations, 7 figures, 5 tables)

This paper contains 26 sections, 10 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Examples of prompts manually designed by clinicians for different categories in different datasets. (a) For atelectasis in the CheXpert Dataset. (b) For COVID in the COVID Dataset. It can be seen that these text prompts are highly related to the characteristics of images in the dataset and strongly dependent on domain experts and domain knowledge.
  • Figure 2: The overall architecture of our proposed model $MedPrompt$. The model mainly includes pre-training (dark gray line) and prompt learning stage (orange line). In the pre-training phase, the model learns transferable representations by matching the similarity between the paired images and the reports. In the prompt learning stage, the model trains an instance-adaptive prompt generator with the help of the image embeddings and finally learns automatic prompts embeddings for zero-shot image classification. When performing few-shot image classification, the model fine-tunes the class embeddings of the prompt generator using very few images in the unseen categories, with the rest of the prompt generator remaining fixed. The green line represents zero-shot or few-shot inference on downstream image classification with the pre-trained model. Best viewed in color.
  • Figure 3: Performance comparison of our model adopting the traditional CNN-based architecture ResNet-50 (the orange lines) and the Transformer-based architecture Swin Transformer (the blue lines) as the image encoder on four datasets. It can be seen that the models with ResNet-50 as the image encoder are overall inferior to their Swin Transformer counterparts. Best viewed in color.
  • Figure 4: Visualization of prompts similarities on four datasets. (a) The similarities of prompts automatically generated by the model. (b) The similarities of manually designed prompts. The ten images based on which the model automatically generates the prompts and the ten manual prompts are randomly selected. Best viewed in color.
  • Figure 5: Class activation maps for images of each category in the four datasets. It can be observed from these cases that our model's image encoder can efficiently learn the most noteworthy regions in an X-ray image.
  • ...and 2 more figures