Table of Contents
Fetching ...

Text-guided Foundation Model Adaptation for Long-Tailed Medical Image Classification

Sirui Li, Li Lin, Yijin Huang, Pujin Cheng, Xiaoying Tang

TL;DR

This work tackles long-tailed medical image classification by adapting foundation-model representations through a two-stage, text-guided framework. It introduces a residual connection adapter for visual features and a two-adapter setup trained on re-balanced data, followed by a linear ensembler that fuses representations at feature or logit levels, guided by text prompts and cosine similarity to semantic text features. The approach achieves state-of-the-art or competitive results on two medical datasets (ISIC2018 and APTOS2019) while dramatically reducing GPU memory usage (about 6.1% of the memory of leading methods) and requiring only lightweight components. The method demonstrates the practical potential of text-guided foundation-model adaptation for handling long-tailed distributions in medical imaging with high efficiency and accessibility.

Abstract

In medical contexts, the imbalanced data distribution in long-tailed datasets, due to scarce labels for rare diseases, greatly impairs the diagnostic accuracy of deep learning models. Recent multimodal text-image supervised foundation models offer new solutions to data scarcity through effective representation learning. However, their limited medical-specific pretraining hinders their performance in medical image classification relative to natural images. To address this issue, we propose a novel Text-guided Foundation model Adaptation for Long-Tailed medical image classification (TFA-LT). We adopt a two-stage training strategy, integrating representations from the foundation model using just two linear adapters and a single ensembler for balanced outcomes. Experimental results on two long-tailed medical image datasets validate the simplicity, lightweight and efficiency of our approach: requiring only 6.1% GPU memory usage of the current best-performing algorithm, our method achieves an accuracy improvement of up to 27.1%, highlighting the substantial potential of foundation model adaptation in this area.

Text-guided Foundation Model Adaptation for Long-Tailed Medical Image Classification

TL;DR

This work tackles long-tailed medical image classification by adapting foundation-model representations through a two-stage, text-guided framework. It introduces a residual connection adapter for visual features and a two-adapter setup trained on re-balanced data, followed by a linear ensembler that fuses representations at feature or logit levels, guided by text prompts and cosine similarity to semantic text features. The approach achieves state-of-the-art or competitive results on two medical datasets (ISIC2018 and APTOS2019) while dramatically reducing GPU memory usage (about 6.1% of the memory of leading methods) and requiring only lightweight components. The method demonstrates the practical potential of text-guided foundation-model adaptation for handling long-tailed distributions in medical imaging with high efficiency and accessibility.

Abstract

In medical contexts, the imbalanced data distribution in long-tailed datasets, due to scarce labels for rare diseases, greatly impairs the diagnostic accuracy of deep learning models. Recent multimodal text-image supervised foundation models offer new solutions to data scarcity through effective representation learning. However, their limited medical-specific pretraining hinders their performance in medical image classification relative to natural images. To address this issue, we propose a novel Text-guided Foundation model Adaptation for Long-Tailed medical image classification (TFA-LT). We adopt a two-stage training strategy, integrating representations from the foundation model using just two linear adapters and a single ensembler for balanced outcomes. Experimental results on two long-tailed medical image datasets validate the simplicity, lightweight and efficiency of our approach: requiring only 6.1% GPU memory usage of the current best-performing algorithm, our method achieves an accuracy improvement of up to 27.1%, highlighting the substantial potential of foundation model adaptation in this area.
Paper Structure (13 sections, 3 equations, 3 figures, 2 tables)

This paper contains 13 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Class distributions and subset divisions of two long-tailed medical datasets employed in our experiments.
  • Figure 2: The architecture of our two-stage framework: the upper section depicts stage I and the design of our residual connection adapter, while the lower section outlines stage II and the two levels of representation ensemble.
  • Figure 3: Comparison of GPU usage during training between TFA-LT's two stages, prefixed with *, and the other 9 comparative benchmark methods during training.