Table of Contents
Fetching ...

AAPL: Adding Attributes to Prompt Learning for Vision-Language Models

Gahyeon Kim, Sohee Kim, Seokju Lee

TL;DR

AAPL addresses generalization gaps in prompt-learning for vision–language models by disentangling augmentation-induced bias from class semantics. It introduces a delta meta token, computed as $\Delta\pi^{1A} = h_{\theta}(f(Aug_A(x_1))) - h_{\theta}(f(x_1))$, and an AdTriplet loss to emphasize attribute information in prompts, yielding $L_{total} = \alpha L_{AdTriplet} + \beta L_{CE}$. Across 11 datasets and multiple evaluation protocols, AAPL consistently improves base-to-new, cross-dataset, and domain-generalization performance relative to CoOp/CoCoOp, while providing insights from augmentation profiling about which augmentations help or hinder generalization. The work demonstrates that attribute-focused prompt learning enhances robustness to domain shifts and unseen classes, with practical implications for deploying VLMs in diverse settings. All mathematical notation is rendered in $...$ to ensure clarity and<|vq_13718|>consistent encoding.

Abstract

Recent advances in large pre-trained vision-language models have demonstrated remarkable performance on zero-shot downstream tasks. Building upon this, recent studies, such as CoOp and CoCoOp, have proposed the use of prompt learning, where context within a prompt is replaced with learnable vectors, leading to significant improvements over manually crafted prompts. However, the performance improvement for unseen classes is still marginal, and to tackle this problem, data augmentation has been frequently used in traditional zero-shot learning techniques. Through our experiments, we have identified important issues in CoOp and CoCoOp: the context learned through traditional image augmentation is biased toward seen classes, negatively impacting generalization to unseen classes. To address this problem, we propose adversarial token embedding to disentangle low-level visual augmentation features from high-level class information when inducing bias in learnable prompts. Through our novel mechanism called "Adding Attributes to Prompt Learning", AAPL, we guide the learnable context to effectively extract text features by focusing on high-level features for unseen classes. We have conducted experiments across 11 datasets, and overall, AAPL shows favorable performances compared to the existing methods in few-shot learning, zero-shot learning, cross-dataset, and domain generalization tasks.

AAPL: Adding Attributes to Prompt Learning for Vision-Language Models

TL;DR

AAPL addresses generalization gaps in prompt-learning for vision–language models by disentangling augmentation-induced bias from class semantics. It introduces a delta meta token, computed as , and an AdTriplet loss to emphasize attribute information in prompts, yielding . Across 11 datasets and multiple evaluation protocols, AAPL consistently improves base-to-new, cross-dataset, and domain-generalization performance relative to CoOp/CoCoOp, while providing insights from augmentation profiling about which augmentations help or hinder generalization. The work demonstrates that attribute-focused prompt learning enhances robustness to domain shifts and unseen classes, with practical implications for deploying VLMs in diverse settings. All mathematical notation is rendered in to ensure clarity and<|vq_13718|>consistent encoding.

Abstract

Recent advances in large pre-trained vision-language models have demonstrated remarkable performance on zero-shot downstream tasks. Building upon this, recent studies, such as CoOp and CoCoOp, have proposed the use of prompt learning, where context within a prompt is replaced with learnable vectors, leading to significant improvements over manually crafted prompts. However, the performance improvement for unseen classes is still marginal, and to tackle this problem, data augmentation has been frequently used in traditional zero-shot learning techniques. Through our experiments, we have identified important issues in CoOp and CoCoOp: the context learned through traditional image augmentation is biased toward seen classes, negatively impacting generalization to unseen classes. To address this problem, we propose adversarial token embedding to disentangle low-level visual augmentation features from high-level class information when inducing bias in learnable prompts. Through our novel mechanism called "Adding Attributes to Prompt Learning", AAPL, we guide the learnable context to effectively extract text features by focusing on high-level features for unseen classes. We have conducted experiments across 11 datasets, and overall, AAPL shows favorable performances compared to the existing methods in few-shot learning, zero-shot learning, cross-dataset, and domain generalization tasks.
Paper Structure (13 sections, 6 equations, 7 figures, 7 tables)

This paper contains 13 sections, 6 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The illustration of AAPL. Training the learnable prompt on the class "apple", since the training data mainly consists of red apples, leads to understanding apples as typically red. When a rare "yellow apple” is input, the instance bias may overlook the yellow attribute and incorrectly predict it as a pear. However, AAPL extracts and decomposes attributes from the image, enhancing attribute-specific bias in the semantic features. This enables robustly improved generalization performance across domains.
  • Figure 2: Overview of AAPL. We apply two distinct random augmentations to the input images, each with the class labels 1 and 2. Once the image features are extracted from the pre-trained CLIP image encoder radford2021learning, they are passed through the metanetzhou2022conditional to acquire the meta token. These are then utilized to subtract the other meta tokens obtained from the augmented images for each class, resulting in delta meta tokens. The goal is to instruct them to use these delta meta tokens regardless of their classification. The delta meta tokens, which are associated with the same augmentation, approach close within the embedding space using the AdTriplet loss, as shown in Eq. \ref{['eq:adtriplet_loss']}. The delta meta tokens acquire attribute-specific features, while the meta token learns semantic features derived from image features, enabling the use of attribute-specific bias in the learnable prompt through the decomposed features.
  • Figure 3: The comparison between meta tokens of CoCoOp and meta tokens of CoCoOp with random augmentation for FGVCAircraft dataset.
  • Figure 4: t-SNE visualization of meta token and delta meta token of CoCoOp zhou2022conditional and AAPL for FGVCAircraft dataset. The colors of the points represent the 14 different augmentations, and 100 data points from the validation set are used for this. $(a)$ and $(c)$ are the visualization of meta token, $(b)$ and $(d)$ are the visualization of delta meta token.
  • Figure 5: Comparison of the number of constraints of the AdTriplet loss. The constraints-2 setting's anchor is just one, e.g., $\Delta\pi^{1B}$, and the constraints-4 setting has two anchors, e.g., $\Delta\pi^{1A}$ and $\Delta\pi^{2B}$.
  • ...and 2 more figures