Table of Contents
Fetching ...

Learning Generalizable Prompt for CLIP with Class Similarity Knowledge

Sehun Jung, Hyang-won Lee

TL;DR

The paper addresses the generalization gap of learnable prompts in CLIP when encountering unseen classes, caused by semantic disruption of text embeddings. It introduces Similarity Alignment Regularization (SAR), a plug-in regularizer that aligns the relational structure of learnable prompts’ text embeddings with those from hand-crafted prompts across base and novel classes, using novel classes generated by ChatGPT-4o and random embedding sampling to reduce overfitting. SAR improves base-to-new generalization across 11 datasets and five baselines, while maintaining or enhancing base accuracy, and proves effective with various word sources. However, SAR incurs extra memory and training time due to computing novel-class embeddings, motivating future work to reduce overhead and explore supervision that minimizes reliance on hand-crafted prompts.

Abstract

In vision-language models (VLMs), prompt tuning has shown its effectiveness in adapting models to downstream tasks. However, learned prompts struggle to generalize to unseen classes, as they tend to overfit to the classes that are targeted during prompt tuning. Examining failure cases, we observed that learned prompts disrupt the semantics of unseen classes, generating text embeddings with incorrect semantic relationships among classes. To address this, we propose Similarity Alignment Regularization (SAR), which regularizes learnable prompts to preserve the semantic relationships among classes captured by hand-crafted prompts. Specifically, we first obtain novel classes related to base classes using ChatGPT-4o and utilize them as potential unseen classes during prompt tuning. Then, by targeting both base and novel classes, SAR aligns the similarity relationships among text embeddings generated by learnable prompts with the similarity relationships from hand-crafted prompts. Extensive experiments applying SAR to existing prompt tuning methods demonstrate its effectiveness in improving generalization to unseen classes.

Learning Generalizable Prompt for CLIP with Class Similarity Knowledge

TL;DR

The paper addresses the generalization gap of learnable prompts in CLIP when encountering unseen classes, caused by semantic disruption of text embeddings. It introduces Similarity Alignment Regularization (SAR), a plug-in regularizer that aligns the relational structure of learnable prompts’ text embeddings with those from hand-crafted prompts across base and novel classes, using novel classes generated by ChatGPT-4o and random embedding sampling to reduce overfitting. SAR improves base-to-new generalization across 11 datasets and five baselines, while maintaining or enhancing base accuracy, and proves effective with various word sources. However, SAR incurs extra memory and training time due to computing novel-class embeddings, motivating future work to reduce overhead and explore supervision that minimizes reliance on hand-crafted prompts.

Abstract

In vision-language models (VLMs), prompt tuning has shown its effectiveness in adapting models to downstream tasks. However, learned prompts struggle to generalize to unseen classes, as they tend to overfit to the classes that are targeted during prompt tuning. Examining failure cases, we observed that learned prompts disrupt the semantics of unseen classes, generating text embeddings with incorrect semantic relationships among classes. To address this, we propose Similarity Alignment Regularization (SAR), which regularizes learnable prompts to preserve the semantic relationships among classes captured by hand-crafted prompts. Specifically, we first obtain novel classes related to base classes using ChatGPT-4o and utilize them as potential unseen classes during prompt tuning. Then, by targeting both base and novel classes, SAR aligns the similarity relationships among text embeddings generated by learnable prompts with the similarity relationships from hand-crafted prompts. Extensive experiments applying SAR to existing prompt tuning methods demonstrate its effectiveness in improving generalization to unseen classes.

Paper Structure

This paper contains 16 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Prompt generalization evaluation. (Top) Heatmap visualization of similarity distribution matrices computed over all (base+new) classes. From left to right: 1) $\mathbf{P}_{\mathtt{CoOp}}$, 2) $\mathbf{P}_{\mathtt{hand}}$, 3) produced by prompts learned by CoOp with SAR applied, and 4) produced by prompts learned by TCP yao2024tcp. In class names, L. and B. are abbreviations of $Land$ and $Building$, respectively. An asterisk (*) before a class name indicates that it is a new class, which was not used during prompt training. (Bottom) t-SNE scatterplots of logits for test images from new classes. In CoOp, the logits points corresponding to images of $River$ and $Sea\ or\ Lake$ are broadly distributed, forming an ambiguous cluster boundary. In contrast, such issues are not observed in the logits visualization of CoOp with SAR applied, thank to the guiding of SAR.
  • Figure 2: Overview of how Similarity-Alignment Regularization (SAR) operates in prompt tuning. SAR targets the base and novel classes, aligning the semantic relationships among text embeddings generated by learnable prompts with those relationships from ensembled hand-crafted prompts. Specifically, this alignment is achieved by minimizing the KL divergence between the corresponding similarity distributions. To mitigate overfitting, random embedding sampling is employed instead of computing similarities across all classes.
  • Figure 3: Effect of number of novel classes on performance in 16-shot setting. The results are averaged across 11 datasets.
  • Figure 4: Left: Effect of regularization weight $\lambda$ on performance in 4-shot setting. Right: Trend of SAR loss as $\lambda$ increases. The results are averaged across 11 datasets.
  • Figure 5: Performance gains of SAR over the baseline across different word sources. 'Oracle' refers to the use of new classes in the dataset as novel classes for SAR. The results are averaged across 11 datasets.