Learning Generalizable Prompt for CLIP with Class Similarity Knowledge
Sehun Jung, Hyang-won Lee
TL;DR
The paper addresses the generalization gap of learnable prompts in CLIP when encountering unseen classes, caused by semantic disruption of text embeddings. It introduces Similarity Alignment Regularization (SAR), a plug-in regularizer that aligns the relational structure of learnable prompts’ text embeddings with those from hand-crafted prompts across base and novel classes, using novel classes generated by ChatGPT-4o and random embedding sampling to reduce overfitting. SAR improves base-to-new generalization across 11 datasets and five baselines, while maintaining or enhancing base accuracy, and proves effective with various word sources. However, SAR incurs extra memory and training time due to computing novel-class embeddings, motivating future work to reduce overhead and explore supervision that minimizes reliance on hand-crafted prompts.
Abstract
In vision-language models (VLMs), prompt tuning has shown its effectiveness in adapting models to downstream tasks. However, learned prompts struggle to generalize to unseen classes, as they tend to overfit to the classes that are targeted during prompt tuning. Examining failure cases, we observed that learned prompts disrupt the semantics of unseen classes, generating text embeddings with incorrect semantic relationships among classes. To address this, we propose Similarity Alignment Regularization (SAR), which regularizes learnable prompts to preserve the semantic relationships among classes captured by hand-crafted prompts. Specifically, we first obtain novel classes related to base classes using ChatGPT-4o and utilize them as potential unseen classes during prompt tuning. Then, by targeting both base and novel classes, SAR aligns the similarity relationships among text embeddings generated by learnable prompts with the similarity relationships from hand-crafted prompts. Extensive experiments applying SAR to existing prompt tuning methods demonstrate its effectiveness in improving generalization to unseen classes.
