Fine-grained Prompt Tuning: A Parameter and Memory Efficient Transfer Learning Method for High-resolution Medical Image Classification
Yijin Huang, Pujin Cheng, Roger Tam, Xiaoying Tang
TL;DR
The paper tackles the memory bottleneck of applying large pre-trained models to high-resolution medical image classification. It proposes Fine-grained Prompt Tuning (FPT), a parameter-efficient transfer learning method that freezes the large pre-trained model and uses a lightweight side network, augmented by asymmetric input, fine-grained prompts, and a cross-attention-based fusion module to transfer knowledge efficiently. Additional mechanisms—important token selection and preloading of intermediate features—further reduce training memory while preserving performance, achieving about $1.8\%$ of learnable parameters and $13\%$ of the memory of a full ViT-B run with $512\times512$ inputs, while maintaining competitive AUC across four medical datasets. The approach demonstrates strong performance and favorable PPE/PME trade-offs, making large pre-trained models more practically usable for high-resolution medical imaging tasks.
Abstract
Parameter-efficient transfer learning (PETL) is proposed as a cost-effective way to transfer pre-trained models to downstream tasks, avoiding the high cost of updating entire large-scale pre-trained models (LPMs). In this work, we present Fine-grained Prompt Tuning (FPT), a novel PETL method for medical image classification. FPT significantly reduces memory consumption compared to other PETL methods, especially in high-resolution input contexts. To achieve this, we first freeze the weights of the LPM and construct a learnable lightweight side network. The frozen LPM takes high-resolution images as input to extract fine-grained features, while the side network is fed low-resolution images to reduce memory usage. To allow the side network to access pre-trained knowledge, we introduce fine-grained prompts that summarize information from the LPM through a fusion module. Important tokens selection and preloading techniques are employed to further reduce training cost and memory requirements. We evaluate FPT on four medical datasets with varying sizes, modalities, and complexities. Experimental results demonstrate that FPT achieves comparable performance to fine-tuning the entire LPM while using only 1.8% of the learnable parameters and 13% of the memory costs of an encoder ViT-B model with a 512 x 512 input resolution.
