Table of Contents
Fetching ...

Curriculum Fine-tuning of Vision Foundation Model for Medical Image Classification Under Label Noise

Yeonguk Yu, Minhwan Ko, Sungho Shin, Kangmin Kim, Kyoobin Lee

TL;DR

CUFIT tackles the problem of noisy labels in medical image classification by exploiting pre-trained vision foundation models through a three-module curriculum: linear probing (LPM) on all data, intermediate adapter fine-tuning (IAM) on samples selected by LPM, and last adapter fine-tuning (LAM) on samples selected by IAM. The method uses a simple agreement-based sample-selection mechanism and trains the three modules simultaneously, with inference based on the final LAM. Empirical results across multiple medical datasets show that CUFIT consistently improves robustness to noisy labels over strong baselines, including full fine-tuning, linear probing, and co-training methods, and exhibits gains on natural-image benchmarks as well. The work highlights the value of combining the stability of linear probing with targeted adapter updates to maximize clean-sample utilization while preserving rich VFM representations. Overall, CUFIT offers a practical and effective paradigm for robust medical image classification under label noise, with potential for broader application to other specialized vision tasks.

Abstract

Deep neural networks have demonstrated remarkable performance in various vision tasks, but their success heavily depends on the quality of the training data. Noisy labels are a critical issue in medical datasets and can significantly degrade model performance. Previous clean sample selection methods have not utilized the well pre-trained features of vision foundation models (VFMs) and assumed that training begins from scratch. In this paper, we propose CUFIT, a curriculum fine-tuning paradigm of VFMs for medical image classification under label noise. Our method is motivated by the fact that linear probing of VFMs is relatively unaffected by noisy samples, as it does not update the feature extractor of the VFM, thus robustly classifying the training samples. Subsequently, curriculum fine-tuning of two adapters is conducted, starting with clean sample selection from the linear probing phase. Our experimental results demonstrate that CUFIT outperforms previous methods across various medical image benchmarks. Specifically, our method surpasses previous baselines by 5.0%, 2.1%, 4.6%, and 5.8% at a 40% noise rate on the HAM10000, APTOS-2019, BloodMnist, and OrgancMnist datasets, respectively. Furthermore, we provide extensive analyses to demonstrate the impact of our method on noisy label detection. For instance, our method shows higher label precision and recall compared to previous approaches. Our work highlights the potential of leveraging VFMs in medical image classification under challenging conditions of noisy labels.

Curriculum Fine-tuning of Vision Foundation Model for Medical Image Classification Under Label Noise

TL;DR

CUFIT tackles the problem of noisy labels in medical image classification by exploiting pre-trained vision foundation models through a three-module curriculum: linear probing (LPM) on all data, intermediate adapter fine-tuning (IAM) on samples selected by LPM, and last adapter fine-tuning (LAM) on samples selected by IAM. The method uses a simple agreement-based sample-selection mechanism and trains the three modules simultaneously, with inference based on the final LAM. Empirical results across multiple medical datasets show that CUFIT consistently improves robustness to noisy labels over strong baselines, including full fine-tuning, linear probing, and co-training methods, and exhibits gains on natural-image benchmarks as well. The work highlights the value of combining the stability of linear probing with targeted adapter updates to maximize clean-sample utilization while preserving rich VFM representations. Overall, CUFIT offers a practical and effective paradigm for robust medical image classification under label noise, with potential for broader application to other specialized vision tasks.

Abstract

Deep neural networks have demonstrated remarkable performance in various vision tasks, but their success heavily depends on the quality of the training data. Noisy labels are a critical issue in medical datasets and can significantly degrade model performance. Previous clean sample selection methods have not utilized the well pre-trained features of vision foundation models (VFMs) and assumed that training begins from scratch. In this paper, we propose CUFIT, a curriculum fine-tuning paradigm of VFMs for medical image classification under label noise. Our method is motivated by the fact that linear probing of VFMs is relatively unaffected by noisy samples, as it does not update the feature extractor of the VFM, thus robustly classifying the training samples. Subsequently, curriculum fine-tuning of two adapters is conducted, starting with clean sample selection from the linear probing phase. Our experimental results demonstrate that CUFIT outperforms previous methods across various medical image benchmarks. Specifically, our method surpasses previous baselines by 5.0%, 2.1%, 4.6%, and 5.8% at a 40% noise rate on the HAM10000, APTOS-2019, BloodMnist, and OrgancMnist datasets, respectively. Furthermore, we provide extensive analyses to demonstrate the impact of our method on noisy label detection. For instance, our method shows higher label precision and recall compared to previous approaches. Our work highlights the potential of leveraging VFMs in medical image classification under challenging conditions of noisy labels.

Paper Structure

This paper contains 21 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Illustration of linear probing (a) and adapter usage (b). Specifically, the weights of the foundation model are frozen, while the fully connected layer or adapter weights (shown in orange) are updated during the training phase. In (c), a performance comparison using a simulated noisy dataset (HAM10000) is presented. It demonstrates that linear probing is more robust to noisy labels compared to the adapter, whereas the adapter outperforms linear probing when there are no noisy labels.
  • Figure 2: Illustration of our proposed training framework, CUFIT, which consists of a pre-trained VFM and three distinct modules: (a) the linear probing module (LPM), (b) the intermediate adapter module (IAM), and (c) the last adapter module (LAM). During the training stage, the LPM selects clean samples for the IAM based on the agreement criterion, and the IAM selects clean samples for the LPM. During the inference stage, only the LAM is used for prediction.
  • Figure 3: Illustration of label precision (a,d), label recall (b,e), and test accuracy (c,f) vs. epoch. The first row is for HAM10000 with 40% noise rate, and the second row is for APTOS-2019 with 40% noise rate.
  • Figure 4: Test accuracy of our method with various VFMs (DINOv1 caron2021emerging, MAE he2022masked, DINOv2 oquab2023dinov2) and adapters (VPT jia2022visual, AdaptFormer chen2022adaptformer, Rein wei2023stronger). We use HAM10000 and APTOS-2019 with 40% noise rate for training.