Table of Contents
Fetching ...

InPK: Infusing Prior Knowledge into Prompt for Vision-Language Models

Shuchang Zhou, Jiwei Wei, Shiyuan He, Yuyang Zhou, Chaoning Zhang, Jie Zou, Ning Xie, Yang Yang

TL;DR

Prompt-tuning of Vision-Language Models often overfits to base classes when learnable tokens are randomly initialized, limiting unseen-class generalization. InPK tackles this by infusing class-specific prior knowledge into the learnable prompt tokens at initialization and progressively reinforcing token–knowledge interactions across multiple feature levels, complemented by a learnable text-to-vision projection for better multimodal alignment. Prior knowledge is generated offline with GPT-4 to provide discriminative attributes for each class, and a regularization term preserves general CLIP semantics while emphasizing class-name information. Across 11 datasets, InPK achieves state-of-the-art results in base-to-novel generalization, few-shot learning, cross-dataset evaluation, and domain generalization, with ablations confirming the effectiveness of the infusion strategy and multi-level interaction.

Abstract

Prompt tuning has become a popular strategy for adapting Vision-Language Models (VLMs) to zero/few-shot visual recognition tasks. Some prompting techniques introduce prior knowledge due to its richness, but when learnable tokens are randomly initialized and disconnected from prior knowledge, they tend to overfit on seen classes and struggle with domain shifts for unseen ones. To address this issue, we propose the InPK model, which infuses class-specific prior knowledge into the learnable tokens during initialization, thus enabling the model to explicitly focus on class-relevant information. Furthermore, to mitigate the weakening of class information by multi-layer encoders, we continuously reinforce the interaction between learnable tokens and prior knowledge across multiple feature levels. This progressive interaction allows the learnable tokens to better capture the fine-grained differences and universal visual concepts within prior knowledge, enabling the model to extract more discriminative and generalized text features. Even for unseen classes, the learned interaction allows the model to capture their common representations and infer their appropriate positions within the existing semantic structure. Moreover, we introduce a learnable text-to-vision projection layer to accommodate the text adjustments, ensuring better alignment of visual-text semantics. Extensive experiments on 11 recognition datasets show that InPK significantly outperforms state-of-the-art methods in multiple zero/few-shot image classification tasks.

InPK: Infusing Prior Knowledge into Prompt for Vision-Language Models

TL;DR

Prompt-tuning of Vision-Language Models often overfits to base classes when learnable tokens are randomly initialized, limiting unseen-class generalization. InPK tackles this by infusing class-specific prior knowledge into the learnable prompt tokens at initialization and progressively reinforcing token–knowledge interactions across multiple feature levels, complemented by a learnable text-to-vision projection for better multimodal alignment. Prior knowledge is generated offline with GPT-4 to provide discriminative attributes for each class, and a regularization term preserves general CLIP semantics while emphasizing class-name information. Across 11 datasets, InPK achieves state-of-the-art results in base-to-novel generalization, few-shot learning, cross-dataset evaluation, and domain generalization, with ablations confirming the effectiveness of the infusion strategy and multi-level interaction.

Abstract

Prompt tuning has become a popular strategy for adapting Vision-Language Models (VLMs) to zero/few-shot visual recognition tasks. Some prompting techniques introduce prior knowledge due to its richness, but when learnable tokens are randomly initialized and disconnected from prior knowledge, they tend to overfit on seen classes and struggle with domain shifts for unseen ones. To address this issue, we propose the InPK model, which infuses class-specific prior knowledge into the learnable tokens during initialization, thus enabling the model to explicitly focus on class-relevant information. Furthermore, to mitigate the weakening of class information by multi-layer encoders, we continuously reinforce the interaction between learnable tokens and prior knowledge across multiple feature levels. This progressive interaction allows the learnable tokens to better capture the fine-grained differences and universal visual concepts within prior knowledge, enabling the model to extract more discriminative and generalized text features. Even for unseen classes, the learned interaction allows the model to capture their common representations and infer their appropriate positions within the existing semantic structure. Moreover, we introduce a learnable text-to-vision projection layer to accommodate the text adjustments, ensuring better alignment of visual-text semantics. Extensive experiments on 11 recognition datasets show that InPK significantly outperforms state-of-the-art methods in multiple zero/few-shot image classification tasks.

Paper Structure

This paper contains 20 sections, 14 equations, 4 figures, 22 tables.

Figures (4)

  • Figure 1: (a) Existing prior knowledge-based prompting techniques initialize learnable tokens randomly, resulting in task-agnostic token representations that prolong the optimization trajectory and increase susceptibility to local optima (illustrated in top). Our approach infuses prior knowledge into learnable tokens before feeding them into the encoder layers each time, providing a prior knowledge-based initialization that explicitly emphasizes class-relevant information (illustrated in bottom). (b) t-SNE visualization of the feature manifolds for our method with and without prior knowledge infusion (pkin) on the EuroSAT dataset. Our method with pkin shows tighter intra-class distances, indicating enhanced consistency, and larger inter-class distances,n reflecting improved class discrimination. (c) In the base-to-base/base-to-novel setting, our method significantly outperforms the state-of-the-art methods regarding average results on 11 recognition datasets.
  • Figure 2: Overview of InPK method. Prior knowledge is generated offline using a predefined instruction template and subsequently fed into the Prior Knowledge-infused text encoder (PKi). Within PKi, class-specific prior knowledge is infused into learnable tokens through attribute-aware attention at the initialization stage, and the interaction between tokens and prior knowledge is progressively reinforced across multiple feature levels. Meanwhile, we introduce a learnable text-to-vision projection layer to better align visual-text semantics. Furthermore, loss $L_{text}$ is applied to mitigate the model's forgetting of general information and to emphasize the role of class names.
  • Figure 3: Comparison of InPK with existing methods in the few-shot learning setting. All models are trained with 1, 2, 4, 8, and 16 shots per class and deployed on the full test set. Our method shows competitive performance in few-shot learning, achieving the highest average results.
  • Figure 4: Ablation on prompt depth (left) and the number of attribute words (right).