Table of Contents
Fetching ...

Medical Knowledge Intervention Prompt Tuning for Medical Image Classification

Ye Du, Nanxi Yu, Shujun Wang

TL;DR

Medical image classification with vision-language models faces high fine-tuning costs. This paper introduces CILMP, a framework that leverages disease-specific knowledge from large language models to generate instance-adaptive prompts for vision-language models through a conditional, low-rank intervention mechanism. Across 11 datasets and multiple modalities, CILMP consistently outperforms state-of-the-art prompt-tuning methods while using only a fraction of trainable parameters, approaching the performance of full fine-tuning. The approach demonstrates the practical value of transferring medical knowledge from LLMs into prompt tuning, enabling robust, efficient, and scalable adaptation of VLMs for clinical tasks.

Abstract

Vision-language foundation models (VLMs) have shown great potential in feature transfer and generalization across a wide spectrum of medical-related downstream tasks. However, fine-tuning these models is resource-intensive due to their large number of parameters. Prompt tuning has emerged as a viable solution to mitigate memory usage and reduce training time while maintaining competitive performance. Nevertheless, the challenge is that existing prompt tuning methods cannot precisely distinguish different kinds of medical concepts, which miss essentially specific disease-related features across various medical imaging modalities in medical image classification tasks. We find that Large Language Models (LLMs), trained on extensive text corpora, are particularly adept at providing this specialized medical knowledge. Motivated by this, we propose incorporating LLMs into the prompt tuning process. Specifically, we introduce the CILMP, Conditional Intervention of Large Language Models for Prompt Tuning, a method that bridges LLMs and VLMs to facilitate the transfer of medical knowledge into VLM prompts. CILMP extracts disease-specific representations from LLMs, intervenes within a low-rank linear subspace, and utilizes them to create disease-specific prompts. Additionally, a conditional mechanism is incorporated to condition the intervention process on each individual medical image, generating instance-adaptive prompts and thus enhancing adaptability. Extensive experiments across diverse medical image datasets demonstrate that CILMP consistently outperforms state-of-the-art prompt tuning methods, demonstrating its effectiveness. Code is available at https://github.com/usr922/cilmp.

Medical Knowledge Intervention Prompt Tuning for Medical Image Classification

TL;DR

Medical image classification with vision-language models faces high fine-tuning costs. This paper introduces CILMP, a framework that leverages disease-specific knowledge from large language models to generate instance-adaptive prompts for vision-language models through a conditional, low-rank intervention mechanism. Across 11 datasets and multiple modalities, CILMP consistently outperforms state-of-the-art prompt-tuning methods while using only a fraction of trainable parameters, approaching the performance of full fine-tuning. The approach demonstrates the practical value of transferring medical knowledge from LLMs into prompt tuning, enabling robust, efficient, and scalable adaptation of VLMs for clinical tasks.

Abstract

Vision-language foundation models (VLMs) have shown great potential in feature transfer and generalization across a wide spectrum of medical-related downstream tasks. However, fine-tuning these models is resource-intensive due to their large number of parameters. Prompt tuning has emerged as a viable solution to mitigate memory usage and reduce training time while maintaining competitive performance. Nevertheless, the challenge is that existing prompt tuning methods cannot precisely distinguish different kinds of medical concepts, which miss essentially specific disease-related features across various medical imaging modalities in medical image classification tasks. We find that Large Language Models (LLMs), trained on extensive text corpora, are particularly adept at providing this specialized medical knowledge. Motivated by this, we propose incorporating LLMs into the prompt tuning process. Specifically, we introduce the CILMP, Conditional Intervention of Large Language Models for Prompt Tuning, a method that bridges LLMs and VLMs to facilitate the transfer of medical knowledge into VLM prompts. CILMP extracts disease-specific representations from LLMs, intervenes within a low-rank linear subspace, and utilizes them to create disease-specific prompts. Additionally, a conditional mechanism is incorporated to condition the intervention process on each individual medical image, generating instance-adaptive prompts and thus enhancing adaptability. Extensive experiments across diverse medical image datasets demonstrate that CILMP consistently outperforms state-of-the-art prompt tuning methods, demonstrating its effectiveness. Code is available at https://github.com/usr922/cilmp.

Paper Structure

This paper contains 40 sections, 12 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Concept illustration of our CILMP method. CILMP first extracts concept-aware representations from a frozen large language model. It then intervenes in these representations with the guidance of image features to generate the adaptive disease prompts for the VLM text encoder.
  • Figure 2: Illustration of the CILMP framework. CILMP first extracts concept-aware representations $\bm{h}_y$ from an LLM. Then, a conditional intervention function is introduced to adapt these representations towards accurate disease label prediction, producing intervened representations $\bar{\bm{h}}_y$. After dimension adjustment, $\tilde{\bm{h}}_y$ are concatenated with the original prompts $\bm{p}_y$ to generate the adaptive disease prompts $\Tilde{\bm{p}}_y$. Finally, $\mathcal{L}_{\text{CILMP}}$ is used to guide the prompt tuning process for the VLM.
  • Figure 3: Centered kernel alignment heatmap kornblith2019similarity between representations from different layers of the LLaMA3-8B llama3modelcard. The red box (last row) displays the similarity between representation from the last layer and those from other layers, while the green box highlights the similarity between adjacent layers.
  • Figure 4: Comparison of our CILMP and the other competitive prompt tuning methods in FLOPs. Comparison is conducted on the ADAM dataset.
  • Figure 5: Qualitative analysis based on t-SNE van2008visualizing visualization. Compared to conventional prompt tuning, CILMP generates features that are more discriminative across classes.
  • ...and 2 more figures