Table of Contents
Fetching ...

Active Prompt Learning with Vision-Language Model Priors

Hoyoung Kim, Seokhee Jin, Changhwan Sung, Jaechang Kim, Jungseul Ok

TL;DR

This work introduces a class-guided clustering that leverages the pre-trained image and text encoders of VLMs, thereby enabling the cluster-balanced acquisition function from the initial round of active learning and proposes a budget-saving selective querying based on adaptive class-wise thresholds.

Abstract

Vision-language models (VLMs) have demonstrated remarkable zero-shot performance across various classification tasks. Nonetheless, their reliance on hand-crafted text prompts for each task hinders efficient adaptation to new tasks. While prompt learning offers a promising solution, most studies focus on maximizing the utilization of given few-shot labeled datasets, often overlooking the potential of careful data selection strategies, which enable higher accuracy with fewer labeled data. This motivates us to study a budget-efficient active prompt learning framework. Specifically, we introduce a class-guided clustering that leverages the pre-trained image and text encoders of VLMs, thereby enabling our cluster-balanced acquisition function from the initial round of active learning. Furthermore, considering the substantial class-wise variance in confidence exhibited by VLMs, we propose a budget-saving selective querying based on adaptive class-wise thresholds. Extensive experiments in active learning scenarios across nine datasets demonstrate that our method outperforms existing baselines.

Active Prompt Learning with Vision-Language Model Priors

TL;DR

This work introduces a class-guided clustering that leverages the pre-trained image and text encoders of VLMs, thereby enabling the cluster-balanced acquisition function from the initial round of active learning and proposes a budget-saving selective querying based on adaptive class-wise thresholds.

Abstract

Vision-language models (VLMs) have demonstrated remarkable zero-shot performance across various classification tasks. Nonetheless, their reliance on hand-crafted text prompts for each task hinders efficient adaptation to new tasks. While prompt learning offers a promising solution, most studies focus on maximizing the utilization of given few-shot labeled datasets, often overlooking the potential of careful data selection strategies, which enable higher accuracy with fewer labeled data. This motivates us to study a budget-efficient active prompt learning framework. Specifically, we introduce a class-guided clustering that leverages the pre-trained image and text encoders of VLMs, thereby enabling our cluster-balanced acquisition function from the initial round of active learning. Furthermore, considering the substantial class-wise variance in confidence exhibited by VLMs, we propose a budget-saving selective querying based on adaptive class-wise thresholds. Extensive experiments in active learning scenarios across nine datasets demonstrate that our method outperforms existing baselines.

Paper Structure

This paper contains 20 sections, 19 equations, 15 figures, 9 tables, 1 algorithm.

Figures (15)

  • Figure 1: An overview of the proposed framework. (a) Class-guided features $\mathcal{F}_{\mathcal{C}}$ are obtained by averaging the image features $I$ with the weighted text features ${\tilde{T}}_{\mathcal{C}}$, using similarity scores as weights. In the heatmaps, $\mathcal{F}_{\mathcal{C}}$ focus on the guided-classes $\mathcal{C}=\{\text{Cat},\text{Dog}\}$ than $I$. (b) $K$-means clustering is performed on $\mathcal{F}_{\mathcal{C}}$. With an increasing $K$, cluster-balanced sampling becomes available in each round. (c) The confidence scores of previously labeled data (circles) serve as thresholds for new candidates (triangles). If a candidate's confidence exceeds its corresponding threshold, a pseudo-label is assigned to conserve the budget, otherwise it is labeled by annotators.
  • Figure 2: GradFAM with various target features. (b) All objects in the image significantly impacts the target image features $I$. (c-f) With our class-guided features $\mathcal{F}_{\mathcal{C}}$ for target features, the heatmap aligns with the target classes $\mathcal{C}$. Further details are in the Appendix \ref{['sec:heatmap_details']}.
  • Figure 3: Effect of the proposed acquisition. (a) Our CB+ SQ outperforms the other baselines in average performance across 9 datasets. (b-h) We achieve comparable performance to other baselines, while reducing the cumulative budget by 12% to 41% across all datasets.
  • Figure 4: T-SNE for class-guided clustering. (a, c) Clustering based solely on image features results in clusters that are poorly separated. (b, d) In contrast, our class-guided clustering, which incorporates class information, leads to more distinct clusters that align with the size of the guiding class set $\mathcal{C}$.
  • Figure 5: Ablation with proposed components. (a) Even with a single component removed, our method still outperforms PCB. (b) The performance of CoreSet improves with the incorporation of our class-guided features.
  • ...and 10 more figures