Table of Contents
Fetching ...

GCI-ViTAL: Gradual Confidence Improvement with Vision Transformers for Active Learning on Label Noise

Moseli Mots'oehli, kyungim Baek

TL;DR

This work proposes a novel deep AL algorithm, Gradual Confidence Improvement with Vision Transformers for Active Learning (GCI-ViTAL), designed to be robust to label noise, and shows that using ViTs leads to better performance over CNNs across all AL strategies, particularly in noisy label settings.

Abstract

Active learning aims to train accurate classifiers while minimizing labeling costs by strategically selecting informative samples for annotation. This study focuses on image classification tasks, comparing AL methods on CIFAR10, CIFAR100, Food101, and the Chest X-ray datasets under varying label noise rates. We investigate the impact of model architecture by comparing Convolutional Neural Networks (CNNs) and Vision Transformer (ViT)-based models. Additionally, we propose a novel deep active learning algorithm, GCI-ViTAL, designed to be robust to label noise. GCI-ViTAL utilizes prediction entropy and the Frobenius norm of last-layer attention vectors compared to class-centric clean set attention vectors. Our method identifies samples that are both uncertain and semantically divergent from typical images in their assigned class. This allows GCI-ViTAL to select informative data points even in the presence of label noise while flagging potentially mislabeled candidates. Label smoothing is applied to train a model that is not overly confident about potentially noisy labels. We evaluate GCI-ViTAL under varying levels of symmetric label noise and compare it to five other AL strategies. Our results demonstrate that using ViTs leads to significant performance improvements over CNNs across all AL strategies, particularly in noisy label settings. We also find that using the semantic information of images as label grounding helps in training a more robust model under label noise. Notably, we do not perform extensive hyperparameter tuning, providing an out-of-the-box comparison that addresses the common challenge practitioners face in selecting models and active learning strategies without an exhaustive literature review on training and fine-tuning vision models on real-world application data.

GCI-ViTAL: Gradual Confidence Improvement with Vision Transformers for Active Learning on Label Noise

TL;DR

This work proposes a novel deep AL algorithm, Gradual Confidence Improvement with Vision Transformers for Active Learning (GCI-ViTAL), designed to be robust to label noise, and shows that using ViTs leads to better performance over CNNs across all AL strategies, particularly in noisy label settings.

Abstract

Active learning aims to train accurate classifiers while minimizing labeling costs by strategically selecting informative samples for annotation. This study focuses on image classification tasks, comparing AL methods on CIFAR10, CIFAR100, Food101, and the Chest X-ray datasets under varying label noise rates. We investigate the impact of model architecture by comparing Convolutional Neural Networks (CNNs) and Vision Transformer (ViT)-based models. Additionally, we propose a novel deep active learning algorithm, GCI-ViTAL, designed to be robust to label noise. GCI-ViTAL utilizes prediction entropy and the Frobenius norm of last-layer attention vectors compared to class-centric clean set attention vectors. Our method identifies samples that are both uncertain and semantically divergent from typical images in their assigned class. This allows GCI-ViTAL to select informative data points even in the presence of label noise while flagging potentially mislabeled candidates. Label smoothing is applied to train a model that is not overly confident about potentially noisy labels. We evaluate GCI-ViTAL under varying levels of symmetric label noise and compare it to five other AL strategies. Our results demonstrate that using ViTs leads to significant performance improvements over CNNs across all AL strategies, particularly in noisy label settings. We also find that using the semantic information of images as label grounding helps in training a more robust model under label noise. Notably, we do not perform extensive hyperparameter tuning, providing an out-of-the-box comparison that addresses the common challenge practitioners face in selecting models and active learning strategies without an exhaustive literature review on training and fine-tuning vision models on real-world application data.

Paper Structure

This paper contains 21 sections, 13 equations, 17 figures, 7 tables, 1 algorithm.

Figures (17)

  • Figure 1: The main components in the AL framework in the presence of a noisy oracle. Each of these components may vary depending on the complexity of the data to be learned and the available resources. Most work in active learning with label noise has focused on the development of query selection algorithms that lead to highly informative and diverse data samples as well as noise-robust DL models.
  • Figure 2: ViT architecture showing how image patches are extracted as well as their positional embedding. The transformer encoder can contain multiple attention and normalization layers. Finally, fully connected layers are added, with a softmax operation for image classification. (Adopted from Kolesnikov:ViT21)
  • Figure 3: The transformer encoder block that constitutes the main component of representation learning in both large language models (LLMs) and ViTs. The main transformer components are the Multi-Head Self-Attention encoder blocks and the Multi-Layer Perceptron (MLP) layers. The normalization layer helps ensure the model is robust to covariate shifts in the features within a batch. (Adopted from Kolesnikov:ViT21)
  • Figure 4: The first stage in the AL cycle where the ViT model is fine-tuned on a clean random set and the C-Core attention vectors are computed for each cluster in the clean set. Once the initial training is done, the model and C-Core vectors are iteratively used in selecting samples for labeling.
  • Figure 5: The GCI-VITAL DAL framework. This diagram shows the iterative active learning cycle, where C-Core attention vectors from the ViT model guide the selection of semantically challenging samples based on their distance from class centroids. Label smoothing mitigates noise, enhancing model robustness. Steps 6a to 11 continue until the labeling budget is exhausted.
  • ...and 12 more figures