Table of Contents
Fetching ...

Active Learning for Vision-Language Models

Bardia Safaei, Vishal M. Patel

TL;DR

This work proposes a novel active learning (AL) framework that enhances the zero-shot classification performance of VLMs by selecting only a few informative samples from the unlabeled data for annotation during training.

Abstract

Pre-trained vision-language models (VLMs) like CLIP have demonstrated impressive zero-shot performance on a wide range of downstream computer vision tasks. However, there still exists a considerable performance gap between these models and a supervised deep model trained on a downstream dataset. To bridge this gap, we propose a novel active learning (AL) framework that enhances the zero-shot classification performance of VLMs by selecting only a few informative samples from the unlabeled data for annotation during training. To achieve this, our approach first calibrates the predicted entropy of VLMs and then utilizes a combination of self-uncertainty and neighbor-aware uncertainty to calculate a reliable uncertainty measure for active sample selection. Our extensive experiments show that the proposed approach outperforms existing AL approaches on several image classification datasets, and significantly enhances the zero-shot performance of VLMs.

Active Learning for Vision-Language Models

TL;DR

This work proposes a novel active learning (AL) framework that enhances the zero-shot classification performance of VLMs by selecting only a few informative samples from the unlabeled data for annotation during training.

Abstract

Pre-trained vision-language models (VLMs) like CLIP have demonstrated impressive zero-shot performance on a wide range of downstream computer vision tasks. However, there still exists a considerable performance gap between these models and a supervised deep model trained on a downstream dataset. To bridge this gap, we propose a novel active learning (AL) framework that enhances the zero-shot classification performance of VLMs by selecting only a few informative samples from the unlabeled data for annotation during training. To achieve this, our approach first calibrates the predicted entropy of VLMs and then utilizes a combination of self-uncertainty and neighbor-aware uncertainty to calculate a reliable uncertainty measure for active sample selection. Our extensive experiments show that the proposed approach outperforms existing AL approaches on several image classification datasets, and significantly enhances the zero-shot performance of VLMs.

Paper Structure

This paper contains 15 sections, 7 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: VL models such as CLIP can effectively transfer to various downstream vision tasks. However, their performance on a novel target dataset still falls behind a supervised model specifically trained on the target dataset. Active learning approaches aim to reduce this performance gap by querying only a few beneficial samples from the unlabeled data, acquiring their labels, and efficiently utilizing them for improving the VLM performance.
  • Figure 2: An illustration of our CEC framework. For a given unlabeled dataset, the visual features are extracted, and the prediction probabilities are calculated using textual features. Next, we calibrate the entropy score and utilize it to precisely quantify the uncertainty of an unlabeled sample, considering its similarities to textual embeddings. Moreover, we regularize this entropy by incorporating the uncertainty of a sample's neighbors. Ultimately, this calibrated entropy is integrated into a uncertainty-weighted clustering approach to ensure diverse sample selection. The selected samples are then annotated and used for prompt tuning of CLIP.
  • Figure 3: Effect of each component within our framework.
  • Figure 4: Hyperparameter analysis.
  • Figure 5: CoOp accuracy results over different 6 AL cycles. From left to right: Textures, Flowers102, and UCF101 datasets.
  • ...and 2 more figures