Table of Contents
Fetching ...

Visual-Language Prompt Tuning with Knowledge-guided Context Optimization

Hantao Yao, Rui Zhang, Changsheng Xu

TL;DR

This paper tackles the problem that task-specific prompt tuning can forget general, transferable textual knowledge when adapting a pretrained visual-language model. It introduces Knowledge-guided Context Optimization (KgCoOp), which adds a regularization term to align learnable prompts with hand-crafted prompts, thereby improving generalization to unseen classes while maintaining training efficiency. Through extensive experiments on base-to-new generalization, domain generalization, and few-shot tasks, KgCoOp consistently yields higher unseen-class performance and harmonic means than prior CoOp-based methods, with comparable or faster training times. The work provides both ablation evidence and practical guidance on how to balance general and task-specific knowledge in prompt tuning for robust visual-language understanding.

Abstract

Prompt tuning is an effective way to adapt the pre-trained visual-language model (VLM) to the downstream task using task-related textual tokens. Representative CoOp-based work combines the learnable textual tokens with the class tokens to obtain specific textual knowledge. However, the specific textual knowledge is the worse generalization to the unseen classes because it forgets the essential general textual knowledge having a strong generalization ability. To tackle this issue, we introduce a novel Knowledge-guided Context Optimization (KgCoOp) to enhance the generalization ability of the learnable prompt for unseen classes. The key insight of KgCoOp is that forgetting about essential knowledge can be alleviated by reducing the discrepancy between the learnable prompt and the hand-crafted prompt. Especially, KgCoOp minimizes the discrepancy between the textual embeddings generated by learned prompts and the hand-crafted prompts. Finally, adding the KgCoOp upon the contrastive loss can make a discriminative prompt for both seen and unseen tasks. Extensive evaluation of several benchmarks demonstrates that the proposed Knowledge-guided Context Optimization is an efficient method for prompt tuning, \emph{i.e.,} achieves better performance with less training time.

Visual-Language Prompt Tuning with Knowledge-guided Context Optimization

TL;DR

This paper tackles the problem that task-specific prompt tuning can forget general, transferable textual knowledge when adapting a pretrained visual-language model. It introduces Knowledge-guided Context Optimization (KgCoOp), which adds a regularization term to align learnable prompts with hand-crafted prompts, thereby improving generalization to unseen classes while maintaining training efficiency. Through extensive experiments on base-to-new generalization, domain generalization, and few-shot tasks, KgCoOp consistently yields higher unseen-class performance and harmonic means than prior CoOp-based methods, with comparable or faster training times. The work provides both ablation evidence and practical guidance on how to balance general and task-specific knowledge in prompt tuning for robust visual-language understanding.

Abstract

Prompt tuning is an effective way to adapt the pre-trained visual-language model (VLM) to the downstream task using task-related textual tokens. Representative CoOp-based work combines the learnable textual tokens with the class tokens to obtain specific textual knowledge. However, the specific textual knowledge is the worse generalization to the unseen classes because it forgets the essential general textual knowledge having a strong generalization ability. To tackle this issue, we introduce a novel Knowledge-guided Context Optimization (KgCoOp) to enhance the generalization ability of the learnable prompt for unseen classes. The key insight of KgCoOp is that forgetting about essential knowledge can be alleviated by reducing the discrepancy between the learnable prompt and the hand-crafted prompt. Especially, KgCoOp minimizes the discrepancy between the textual embeddings generated by learned prompts and the hand-crafted prompts. Finally, adding the KgCoOp upon the contrastive loss can make a discriminative prompt for both seen and unseen tasks. Extensive evaluation of several benchmarks demonstrates that the proposed Knowledge-guided Context Optimization is an efficient method for prompt tuning, \emph{i.e.,} achieves better performance with less training time.
Paper Structure (21 sections, 5 equations, 7 figures, 29 tables)

This paper contains 21 sections, 5 equations, 7 figures, 29 tables.

Figures (7)

  • Figure 1: For the CoOp-based prompt tuning, the degree of performance degradation $\triangledown_{new}$ on the New classes is consistent with the distance between the learnable textual embedding $\mathbf{w}_{coop}$ and the hand-crafted textual embedding $\mathbf{w}_{clip}$. The larger distance, the more severe the performance degradation. $\sigma_{clip}$ and $\sigma_{coop}$ are the accuracy of New classes for CLIP and CoOp, respectively.
  • Figure 2: The framework of the Knowledge-guided Context Optimization for prompt tuning. $\mathcal{L}_{ce}$ is the standard cross-entropy loss, and $\mathcal{L}_{kg}$ is the proposed Knowledge-guided Context Optimization contraint to minimize the discrepancy between the special knowledge (learnable textual embeddings) and the general knowledge(the textual embeddings generated by the hand-crafted prompt).
  • Figure 3: Effect of $\lambda$ for 4-shot and 16-shot settings on the base-to-new generalization. H: Harmonic mean
  • Figure A1: Effect of context length.
  • Figure A2: Effect of initialization.
  • ...and 2 more figures