Table of Contents
Fetching ...

Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels

Maria Marrium, Arif Mahmood, Mohammed Bennamoun

TL;DR

Evaluated ViT fine-tuning to noisy labels learning indicates that incorporating entropy regularization enhances the performance of established loss functions such as CE and FL, as well as the robustness of the six studied NLL methods across both ViT backbones.

Abstract

Automatic annotation of large-scale datasets can introduce noisy training data labels, which adversely affect the learning process of deep neural networks (DNNs). Consequently, Noisy Labels Learning (NLL) has become a critical research field for Convolutional Neural Networks (CNNs), though it remains less explored for Vision Transformers (ViTs). In this study, we evaluate the vulnerability of ViT fine-tuning to noisy labels and compare its robustness with CNNs. We also investigate whether NLL methods developed for CNNs are equally effective for ViTs. Using linear probing and MLP-K fine-tuning, we benchmark two ViT backbones (ViT-B/16 and ViT-L/16) using three commonly used classification losses: Cross Entropy (CE), Focal Loss (FL), and Mean Absolute Error (MAE), alongside six robust NLL methods: GCE, SCE, NLNL, APL, NCE+AGCE, and ANL-CE. The evaluation is conducted across six datasets including MNIST, CIFAR-10/100, WebVision, Clothing1M, and Food-101N. Furthermore, we explore whether implicit prediction entropy minimization contributes to ViT robustness against noisy labels, noting a general trend of prediction entropy reduction across most NLL methods. Building on this observation, we examine whether explicit entropy minimization could enhance ViT resilience to noisy labels. Our findings indicate that incorporating entropy regularization enhances the performance of established loss functions such as CE and FL, as well as the robustness of the six studied NLL methods across both ViT backbones.

Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels

TL;DR

Evaluated ViT fine-tuning to noisy labels learning indicates that incorporating entropy regularization enhances the performance of established loss functions such as CE and FL, as well as the robustness of the six studied NLL methods across both ViT backbones.

Abstract

Automatic annotation of large-scale datasets can introduce noisy training data labels, which adversely affect the learning process of deep neural networks (DNNs). Consequently, Noisy Labels Learning (NLL) has become a critical research field for Convolutional Neural Networks (CNNs), though it remains less explored for Vision Transformers (ViTs). In this study, we evaluate the vulnerability of ViT fine-tuning to noisy labels and compare its robustness with CNNs. We also investigate whether NLL methods developed for CNNs are equally effective for ViTs. Using linear probing and MLP-K fine-tuning, we benchmark two ViT backbones (ViT-B/16 and ViT-L/16) using three commonly used classification losses: Cross Entropy (CE), Focal Loss (FL), and Mean Absolute Error (MAE), alongside six robust NLL methods: GCE, SCE, NLNL, APL, NCE+AGCE, and ANL-CE. The evaluation is conducted across six datasets including MNIST, CIFAR-10/100, WebVision, Clothing1M, and Food-101N. Furthermore, we explore whether implicit prediction entropy minimization contributes to ViT robustness against noisy labels, noting a general trend of prediction entropy reduction across most NLL methods. Building on this observation, we examine whether explicit entropy minimization could enhance ViT resilience to noisy labels. Our findings indicate that incorporating entropy regularization enhances the performance of established loss functions such as CE and FL, as well as the robustness of the six studied NLL methods across both ViT backbones.
Paper Structure (24 sections, 3 equations, 5 figures, 20 tables)

This paper contains 24 sections, 3 equations, 5 figures, 20 tables.

Figures (5)

  • Figure 1: Prediction Entropy and Validation Accuracy Trends during Training on Noisy CIFAR-100 Data using ViT-B/16 with MLP-3 Fine-Tuning. Illustration of the changes in prediction entropy and the corresponding validation accuracy over 100 epochs for various classification loss functions: Cross-Entropy (CE), Focal Loss (FL) lin2017focal, NCE+AGCE zhou2021asymmetric, and ANL-CE ye2023active. The graph shows that as prediction entropy decreases, there is a marked improvement in validation accuracy, indicating effective learning and adaptation to noisy data conditions.
  • Figure 2: Comparative Diagram of Five Fine-Tuning Techniques for Vision Transformers. Details of the architectural modifications in ViTs when employing different fine-tuning strategies: Full Fine-Tuning, AdaptFormer chen2022adaptformer, Visual Prompt Tuning jia2022visual, MLP-K, and Linear Probing he2020momentum. Each diagram shows which components of the architecture are tunable (green) versus frozen (pink) during the fine-tuning process. Specific elements such as input tokens and prompts are indicated, providing insights into how each technique modifies the standard ViT architecture to adapt to training constraints and objectives.
  • Figure 3: Impact of Noise Rates on Test Accuracy and Computational Overhead for Various Fine-Tuning Techniques.(a) illustrates the test accuracy of five fine-tuning methods—Full Fine-Tuning, AdaptFormer chen2022adaptformer, Visual Prompt Tuning jia2022visual, MLP-3, and Linear Probing he2020momentum-on the CIFAR-10 dataset under increasing symmetric noise rates from 0.2 to 0.8. (b) similarly depicts test accuracy as asymmetric noise levels increase from 0.2 to 0.4, demonstrating how each method copes with noise imbalance. (c) compares the computational overhead by showing the training time and the number of learnable parameters across these fine-tuning techniques, highlighting differences in computational efficiency and resource demands.
  • Figure 4: Robustness comparison between CNNs and Vision Transformers (ViTs) across different noise types and levels on CIFAR-10 and CIFAR-100 datasets. The misclassification error is plotted against increasing noise rates for both symmetric and asymmetric noise. Results indicate that ViTs exhibit greater robustness to noisy training labels compared to CNNs, particularly as the noise rate increases. Performance is measured using the cross-entropy (CE) loss function across all model backbones.
  • Figure 5: Impact of varying $\lambda_l$ on test accuracy for CIFAR-10 using CE+$\lambda_lH_l$ with ViT-B/16+MLP-3 under (a) symmetric noise and (b) asymmetric noise. The linear scheduling of $\lambda_l$ (Linear(0$\rightarrow$0.3)) achieves the best performance across both noise types.