Text-guided Foundation Model Adaptation for Long-Tailed Medical Image Classification
Sirui Li, Li Lin, Yijin Huang, Pujin Cheng, Xiaoying Tang
TL;DR
This work tackles long-tailed medical image classification by adapting foundation-model representations through a two-stage, text-guided framework. It introduces a residual connection adapter for visual features and a two-adapter setup trained on re-balanced data, followed by a linear ensembler that fuses representations at feature or logit levels, guided by text prompts and cosine similarity to semantic text features. The approach achieves state-of-the-art or competitive results on two medical datasets (ISIC2018 and APTOS2019) while dramatically reducing GPU memory usage (about 6.1% of the memory of leading methods) and requiring only lightweight components. The method demonstrates the practical potential of text-guided foundation-model adaptation for handling long-tailed distributions in medical imaging with high efficiency and accessibility.
Abstract
In medical contexts, the imbalanced data distribution in long-tailed datasets, due to scarce labels for rare diseases, greatly impairs the diagnostic accuracy of deep learning models. Recent multimodal text-image supervised foundation models offer new solutions to data scarcity through effective representation learning. However, their limited medical-specific pretraining hinders their performance in medical image classification relative to natural images. To address this issue, we propose a novel Text-guided Foundation model Adaptation for Long-Tailed medical image classification (TFA-LT). We adopt a two-stage training strategy, integrating representations from the foundation model using just two linear adapters and a single ensembler for balanced outcomes. Experimental results on two long-tailed medical image datasets validate the simplicity, lightweight and efficiency of our approach: requiring only 6.1% GPU memory usage of the current best-performing algorithm, our method achieves an accuracy improvement of up to 27.1%, highlighting the substantial potential of foundation model adaptation in this area.
