Table of Contents
Fetching ...

Can Better Text Semantics in Prompt Tuning Improve VLM Generalization?

Hari Chandana Kuchibhotla, Sai Srinivas Kancheti, Abbavaram Gowtham Reddy, Vineeth N Balasubramanian

TL;DR

This work introduces a prompt-tuning method that leverages class descriptions obtained from Large Language Models (LLMs) that is used to bridge image and text modalities, and shows that this method outperforms established methods.

Abstract

Going beyond mere fine-tuning of vision-language models (VLMs), learnable prompt tuning has emerged as a promising, resource-efficient alternative. Despite their potential, effectively learning prompts faces the following challenges: (i) training in a low-shot scenario results in overfitting, limiting adaptability, and yielding weaker performance on newer classes or datasets; (ii) prompt-tuning's efficacy heavily relies on the label space, with decreased performance in large class spaces, signaling potential gaps in bridging image and class concepts. In this work, we investigate whether better text semantics can help address these concerns. In particular, we introduce a prompt-tuning method that leverages class descriptions obtained from Large Language Models (LLMs). These class descriptions are used to bridge image and text modalities. Our approach constructs part-level description-guided image and text features, which are subsequently aligned to learn more generalizable prompts. Our comprehensive experiments conducted across 11 benchmark datasets show that our method outperforms established methods, demonstrating substantial improvements.

Can Better Text Semantics in Prompt Tuning Improve VLM Generalization?

TL;DR

This work introduces a prompt-tuning method that leverages class descriptions obtained from Large Language Models (LLMs) that is used to bridge image and text modalities, and shows that this method outperforms established methods.

Abstract

Going beyond mere fine-tuning of vision-language models (VLMs), learnable prompt tuning has emerged as a promising, resource-efficient alternative. Despite their potential, effectively learning prompts faces the following challenges: (i) training in a low-shot scenario results in overfitting, limiting adaptability, and yielding weaker performance on newer classes or datasets; (ii) prompt-tuning's efficacy heavily relies on the label space, with decreased performance in large class spaces, signaling potential gaps in bridging image and class concepts. In this work, we investigate whether better text semantics can help address these concerns. In particular, we introduce a prompt-tuning method that leverages class descriptions obtained from Large Language Models (LLMs). These class descriptions are used to bridge image and text modalities. Our approach constructs part-level description-guided image and text features, which are subsequently aligned to learn more generalizable prompts. Our comprehensive experiments conducted across 11 benchmark datasets show that our method outperforms established methods, demonstrating substantial improvements.
Paper Structure (19 sections, 7 equations, 7 figures, 19 tables, 1 algorithm)

This paper contains 19 sections, 7 equations, 7 figures, 19 tables, 1 algorithm.

Figures (7)

  • Figure 1: Top: Comparison of GradCAM gradcam visualizations for our proposed method SAP against other baselines, on classes "Applying Lipstick" and "Clean and Jerk" from an Action Recognition dataset ucf101. The saliency maps indicate image regions that are most relevant to the descriptions "A photo of applying lipstick has a person applying lipstick to lips" and "A photo of clean and jerk which has a person lifting a barbell" respectively. SAP effectively localizes the text semantics in images compared to baselines. Bottom: SAP surpasses other baselines on Generalized Zero-Shot (GZS) and Base-to-Novel (B2N) benchmarks, showing improvements of +1.6% and +1.2 on Novel Accuracy and Harmonic Mean (HM) for GZS, and +1.4% and +0.9 for B2N compared to best performing baselines.
  • Figure 2: Our proposed workflow, SAP, performs part-based semantic alignment between image and text features. SAP integrates class descriptions into the text template which are passed through the text encoder to construct description-guided text features. Global and local image features are obtained from the image encoder. Description-guided image features are obtained by performing parameter free cross-attention between class descriptions and local features. These image features are pooled into a mean description-guided image feature, which is then fused with the global image feature to obtain the fused image feature. Description-guided text features and the fused image feature contain part-level semantic information, and are semantically aligned. We optimize a cross-entropy loss $L_{ce}$, and two steering losses $L_{steer}^{v}$, and $L_{steer}^{t}$.
  • Figure 3: Addition of a bias vector to the last transformer block in $\theta$
  • Figure 4: Comparison in the OVC setting. We show average Base, Novel, and HM accuracies over all 11 datasets. During evaluation, descriptions of each class are provided instead of the class name, and visual recognition is conducted based on these descriptions. SAP outperforms baselines by average Base (by $+1.75\%$), Novel (by $+1.76\%$) and HM (by $+2.04\%$) computed over all datasets. Detailed dataset-wise results are presented in Appendix § \ref{['app sec additional results']}.
  • Figure 5: Images are highlighted at regions of highest activation relevant to specific text phrases, as identified by their prompted image and text encoders. Qualitatively, SAP localizes better than the existing baselines.
  • ...and 2 more figures