Table of Contents
Fetching ...

ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models

Xinyu Tian, Shu Zou, Zhaoyuan Yang, Jing Zhang

TL;DR

Attribute-Guided Prompt Tuning (ArGue) is introduced, which significantly out-performs current state-of-the-art prompt tuning methods on both novel class prediction and out-of-distribution general-ization tasks.

Abstract

Although soft prompt tuning is effective in efficiently adapting Vision-Language (V&L) models for downstream tasks, it shows limitations in dealing with distribution shifts. We address this issue with Attribute-Guided Prompt Tuning (ArGue), making three key contributions. 1) In contrast to the conventional approach of directly appending soft prompts preceding class names, we align the model with primitive visual attributes generated by Large Language Models (LLMs). We posit that a model's ability to express high confidence in these attributes signifies its capacity to discern the correct class rationales. 2) We introduce attribute sampling to eliminate disadvantageous attributes, thus only semantically meaningful attributes are preserved. 3) We propose negative prompting, explicitly enumerating class-agnostic attributes to activate spurious correlations and encourage the model to generate highly orthogonal probability distributions in relation to these negative features. In experiments, our method significantly outperforms current state-of-the-art prompt tuning methods on both novel class prediction and out-of-distribution generalization tasks.

ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models

TL;DR

Attribute-Guided Prompt Tuning (ArGue) is introduced, which significantly out-performs current state-of-the-art prompt tuning methods on both novel class prediction and out-of-distribution general-ization tasks.

Abstract

Although soft prompt tuning is effective in efficiently adapting Vision-Language (V&L) models for downstream tasks, it shows limitations in dealing with distribution shifts. We address this issue with Attribute-Guided Prompt Tuning (ArGue), making three key contributions. 1) In contrast to the conventional approach of directly appending soft prompts preceding class names, we align the model with primitive visual attributes generated by Large Language Models (LLMs). We posit that a model's ability to express high confidence in these attributes signifies its capacity to discern the correct class rationales. 2) We introduce attribute sampling to eliminate disadvantageous attributes, thus only semantically meaningful attributes are preserved. 3) We propose negative prompting, explicitly enumerating class-agnostic attributes to activate spurious correlations and encourage the model to generate highly orthogonal probability distributions in relation to these negative features. In experiments, our method significantly outperforms current state-of-the-art prompt tuning methods on both novel class prediction and out-of-distribution generalization tasks.
Paper Structure (15 sections, 10 equations, 8 figures, 8 tables)

This paper contains 15 sections, 10 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: The illustration of negative prompting. Given an image of a cat (Left), we visualize the model rationale with Grad-CAM selvaraju2017grad, which highlights the image pixels significantly determining the results (Middle). The standard prompt could be ${\rm a \ photo \ of \ a \ cat}$, where vanilla models, e.g. CLIP radford2021learning, give high confidence on the ground truth class (the "CLIP" column). However, a negative prompt, e.g., ${\rm the \ background \ of \ a \ cat}$, yields biased prediction since it activates the spurious correlation, i.e., $\rm background$. In contrast, our attribute-guided model (the "Ours" column) disregards incorrect rationales and bases its predictions solely on class-specific semantics.
  • Figure 2: The pipeline of ArGue. In (a), we instruct the LLMs to generate attribute candidates using various LLM templates. In (b), we extract semantically relevant attributes through an assessment of their similarity to images, as described in Sec. \ref{['sub:attr_samp']}. In (c), with guidance from the selected attributes and the application of negative prompting, we construct a set of soft tokens tailored to the task, which is detailed in Sec. \ref{['sec:PR']} and Sec. \ref{['sec:NP']}.
  • Figure 3: Two example classes from (a) ImageNet and (b) ImageNet-Sketch for the attribute sampling procedure. We demonstrate several attributes inside each class and the number within the yellow bar indicates its similarity to images in CLIP space. For each class, we designate 3 clusters, resulting in the selection of 3 attributes with the highest similarity score and they are framed with the black box.
  • Figure 4: The Grad-CAM visualization of our method and baselines.(A) contains visual images. (B) features a comparison between our method and baselines using standard prompts, where CoOp and ArGue-N replace the template ${\rm A \ photo \ of \ a}$ with their respective soft tokens. (C) reveals the rationale of ArGue-N concerning various visual attributes. (D) showcases the negative prompt used during training.
  • Figure 5: The absolute improvement of manual labeling compared with LLMs on novel class prediction. We randomly select 10 classes from each benchmark dataset for simplicity, i.e., 5 base classes and 5 novel classes. The accuracy is calculated solely based on the selected classes. It is worth noting that we have omitted the incorporation of negative prompting in the comparison, and attribute sampling has not been applied in the context of manual labeling.
  • ...and 3 more figures