Table of Contents
Fetching ...

Active Prompt Learning in Vision Language Models

Jihwan Bang, Sumyeong Ahn, Jae-Gil Lee

TL;DR

This paper tackles the challenge of adapting pre-trained vision-language models through active learning, identifying that naive sampling amplifies class imbalance and degrades performance. To address this, the authors introduce PCB, a two-stage, balance-aware framework that uses pseudo-labels from the VLM to construct a balanced query set before querying experts, integrated with prompt learning. They further enhance robustness via description augmentation, generating per-class descriptions to support multiple text embeddings and two aggregation schemes (AS and AE). Extensive experiments across seven datasets demonstrate that PCB improves over standard active-learning baselines and random sampling, with gains up to about 4.6 percentage points, and often yields the best results when combined with BADGE; the approach thus provides a practical path for efficient task adaptation of VLMs.

Abstract

Pre-trained Vision Language Models (VLMs) have demonstrated notable progress in various zero-shot tasks, such as classification and retrieval. Despite their performance, because improving performance on new tasks requires task-specific knowledge, their adaptation is essential. While labels are needed for the adaptation, acquiring them is typically expensive. To overcome this challenge, active learning, a method of achieving a high performance by obtaining labels for a small number of samples from experts, has been studied. Active learning primarily focuses on selecting unlabeled samples for labeling and leveraging them to train models. In this study, we pose the question, "how can the pre-trained VLMs be adapted under the active learning framework?" In response to this inquiry, we observe that (1) simply applying a conventional active learning framework to pre-trained VLMs even may degrade performance compared to random selection because of the class imbalance in labeling candidates, and (2) the knowledge of VLMs can provide hints for achieving the balance before labeling. Based on these observations, we devise a novel active learning framework for VLMs, denoted as PCB. To assess the effectiveness of our approach, we conduct experiments on seven different real-world datasets, and the results demonstrate that PCB surpasses conventional active learning and random sampling methods. Code will be available in https://github.com/kaist-dmlab/pcb .

Active Prompt Learning in Vision Language Models

TL;DR

This paper tackles the challenge of adapting pre-trained vision-language models through active learning, identifying that naive sampling amplifies class imbalance and degrades performance. To address this, the authors introduce PCB, a two-stage, balance-aware framework that uses pseudo-labels from the VLM to construct a balanced query set before querying experts, integrated with prompt learning. They further enhance robustness via description augmentation, generating per-class descriptions to support multiple text embeddings and two aggregation schemes (AS and AE). Extensive experiments across seven datasets demonstrate that PCB improves over standard active-learning baselines and random sampling, with gains up to about 4.6 percentage points, and often yields the best results when combined with BADGE; the approach thus provides a practical path for efficient task adaptation of VLMs.

Abstract

Pre-trained Vision Language Models (VLMs) have demonstrated notable progress in various zero-shot tasks, such as classification and retrieval. Despite their performance, because improving performance on new tasks requires task-specific knowledge, their adaptation is essential. While labels are needed for the adaptation, acquiring them is typically expensive. To overcome this challenge, active learning, a method of achieving a high performance by obtaining labels for a small number of samples from experts, has been studied. Active learning primarily focuses on selecting unlabeled samples for labeling and leveraging them to train models. In this study, we pose the question, "how can the pre-trained VLMs be adapted under the active learning framework?" In response to this inquiry, we observe that (1) simply applying a conventional active learning framework to pre-trained VLMs even may degrade performance compared to random selection because of the class imbalance in labeling candidates, and (2) the knowledge of VLMs can provide hints for achieving the balance before labeling. Based on these observations, we devise a novel active learning framework for VLMs, denoted as PCB. To assess the effectiveness of our approach, we conduct experiments on seven different real-world datasets, and the results demonstrate that PCB surpasses conventional active learning and random sampling methods. Code will be available in https://github.com/kaist-dmlab/pcb .
Paper Structure (19 sections, 9 equations, 7 figures, 7 tables, 2 algorithms)

This paper contains 19 sections, 9 equations, 7 figures, 7 tables, 2 algorithms.

Figures (7)

  • Figure 1: Key motivation and complete process behind active prompt learning. When we emply a traditional active learning framework for adapting prompt learning to a new target task, the active learning sampler incurs a significant imbalance (indicated by red bars). Thus, this imbalance results in an inability to enhance the ultimate performance (as indicated by blue bars). In this paper, we introduce a novel algorithm named PCB that rectifies this imbalance by harnessing the knowledge of VLMs, enabling effective utilization of the oracle.
  • Figure 2: Learning curve. Average accuracy on downstream tasks with the ViT-B/32 image encoder for each round.
  • Figure 3: Imbalance curve. Average variance of the number of labeled samples for each class on downstream tasks with the ViT-B/32 image encoder for each round.
  • Figure 4: Accuracy and imbalance in terms of various $\gamma$ on Flowers102 (Upper) and DTD (Bottom).
  • Figure 5: CoOp case analysis of BADGE on Flowers102.
  • ...and 2 more figures