Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels

Maria Marrium; Arif Mahmood; Mohammed Bennamoun

Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels

Maria Marrium, Arif Mahmood, Mohammed Bennamoun

TL;DR

Evaluated ViT fine-tuning to noisy labels learning indicates that incorporating entropy regularization enhances the performance of established loss functions such as CE and FL, as well as the robustness of the six studied NLL methods across both ViT backbones.

Abstract

Automatic annotation of large-scale datasets can introduce noisy training data labels, which adversely affect the learning process of deep neural networks (DNNs). Consequently, Noisy Labels Learning (NLL) has become a critical research field for Convolutional Neural Networks (CNNs), though it remains less explored for Vision Transformers (ViTs). In this study, we evaluate the vulnerability of ViT fine-tuning to noisy labels and compare its robustness with CNNs. We also investigate whether NLL methods developed for CNNs are equally effective for ViTs. Using linear probing and MLP-K fine-tuning, we benchmark two ViT backbones (ViT-B/16 and ViT-L/16) using three commonly used classification losses: Cross Entropy (CE), Focal Loss (FL), and Mean Absolute Error (MAE), alongside six robust NLL methods: GCE, SCE, NLNL, APL, NCE+AGCE, and ANL-CE. The evaluation is conducted across six datasets including MNIST, CIFAR-10/100, WebVision, Clothing1M, and Food-101N. Furthermore, we explore whether implicit prediction entropy minimization contributes to ViT robustness against noisy labels, noting a general trend of prediction entropy reduction across most NLL methods. Building on this observation, we examine whether explicit entropy minimization could enhance ViT resilience to noisy labels. Our findings indicate that incorporating entropy regularization enhances the performance of established loss functions such as CE and FL, as well as the robustness of the six studied NLL methods across both ViT backbones.

Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels

TL;DR

Abstract

Paper Structure (24 sections, 3 equations, 5 figures, 20 tables)

This paper contains 24 sections, 3 equations, 5 figures, 20 tables.

Introduction
Related Work
Deep Learning-based NLL Methods
ViT Fine-tuning Techniques
Proposed Methodology
Problem Formulation
Label Noise Generation.
Entropy Regularization as a Robust Loss Function
Motivation for Entropy Regularization
Explicit Entropy Regularization
Experiments and Results
Vulnerability of ViT Fine-Tuning to Noisy Labels
Robustness Comparison of CNNs Vs. ViTs
Effectiveness of Existing NLL methods for ViTs
Implicit Entropy Minimization Relation with Performance
...and 9 more sections

Figures (5)

Figure 1: Prediction Entropy and Validation Accuracy Trends during Training on Noisy CIFAR-100 Data using ViT-B/16 with MLP-3 Fine-Tuning. Illustration of the changes in prediction entropy and the corresponding validation accuracy over 100 epochs for various classification loss functions: Cross-Entropy (CE), Focal Loss (FL) lin2017focal, NCE+AGCE zhou2021asymmetric, and ANL-CE ye2023active. The graph shows that as prediction entropy decreases, there is a marked improvement in validation accuracy, indicating effective learning and adaptation to noisy data conditions.
Figure 2: Comparative Diagram of Five Fine-Tuning Techniques for Vision Transformers. Details of the architectural modifications in ViTs when employing different fine-tuning strategies: Full Fine-Tuning, AdaptFormer chen2022adaptformer, Visual Prompt Tuning jia2022visual, MLP-K, and Linear Probing he2020momentum. Each diagram shows which components of the architecture are tunable (green) versus frozen (pink) during the fine-tuning process. Specific elements such as input tokens and prompts are indicated, providing insights into how each technique modifies the standard ViT architecture to adapt to training constraints and objectives.
Figure 3: Impact of Noise Rates on Test Accuracy and Computational Overhead for Various Fine-Tuning Techniques.(a) illustrates the test accuracy of five fine-tuning methods—Full Fine-Tuning, AdaptFormer chen2022adaptformer, Visual Prompt Tuning jia2022visual, MLP-3, and Linear Probing he2020momentum-on the CIFAR-10 dataset under increasing symmetric noise rates from 0.2 to 0.8. (b) similarly depicts test accuracy as asymmetric noise levels increase from 0.2 to 0.4, demonstrating how each method copes with noise imbalance. (c) compares the computational overhead by showing the training time and the number of learnable parameters across these fine-tuning techniques, highlighting differences in computational efficiency and resource demands.
Figure 4: Robustness comparison between CNNs and Vision Transformers (ViTs) across different noise types and levels on CIFAR-10 and CIFAR-100 datasets. The misclassification error is plotted against increasing noise rates for both symmetric and asymmetric noise. Results indicate that ViTs exhibit greater robustness to noisy training labels compared to CNNs, particularly as the noise rate increases. Performance is measured using the cross-entropy (CE) loss function across all model backbones.
Figure 5: Impact of varying $\lambda_l$ on test accuracy for CIFAR-10 using CE+$\lambda_lH_l$ with ViT-B/16+MLP-3 under (a) symmetric noise and (b) asymmetric noise. The linear scheduling of $\lambda_l$ (Linear(0$\rightarrow$0.3)) achieves the best performance across both noise types.

Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels

TL;DR

Abstract

Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels

Authors

TL;DR

Abstract

Table of Contents

Figures (5)