Curriculum Fine-tuning of Vision Foundation Model for Medical Image Classification Under Label Noise
Yeonguk Yu, Minhwan Ko, Sungho Shin, Kangmin Kim, Kyoobin Lee
TL;DR
CUFIT tackles the problem of noisy labels in medical image classification by exploiting pre-trained vision foundation models through a three-module curriculum: linear probing (LPM) on all data, intermediate adapter fine-tuning (IAM) on samples selected by LPM, and last adapter fine-tuning (LAM) on samples selected by IAM. The method uses a simple agreement-based sample-selection mechanism and trains the three modules simultaneously, with inference based on the final LAM. Empirical results across multiple medical datasets show that CUFIT consistently improves robustness to noisy labels over strong baselines, including full fine-tuning, linear probing, and co-training methods, and exhibits gains on natural-image benchmarks as well. The work highlights the value of combining the stability of linear probing with targeted adapter updates to maximize clean-sample utilization while preserving rich VFM representations. Overall, CUFIT offers a practical and effective paradigm for robust medical image classification under label noise, with potential for broader application to other specialized vision tasks.
Abstract
Deep neural networks have demonstrated remarkable performance in various vision tasks, but their success heavily depends on the quality of the training data. Noisy labels are a critical issue in medical datasets and can significantly degrade model performance. Previous clean sample selection methods have not utilized the well pre-trained features of vision foundation models (VFMs) and assumed that training begins from scratch. In this paper, we propose CUFIT, a curriculum fine-tuning paradigm of VFMs for medical image classification under label noise. Our method is motivated by the fact that linear probing of VFMs is relatively unaffected by noisy samples, as it does not update the feature extractor of the VFM, thus robustly classifying the training samples. Subsequently, curriculum fine-tuning of two adapters is conducted, starting with clean sample selection from the linear probing phase. Our experimental results demonstrate that CUFIT outperforms previous methods across various medical image benchmarks. Specifically, our method surpasses previous baselines by 5.0%, 2.1%, 4.6%, and 5.8% at a 40% noise rate on the HAM10000, APTOS-2019, BloodMnist, and OrgancMnist datasets, respectively. Furthermore, we provide extensive analyses to demonstrate the impact of our method on noisy label detection. For instance, our method shows higher label precision and recall compared to previous approaches. Our work highlights the potential of leveraging VFMs in medical image classification under challenging conditions of noisy labels.
