Table of Contents
Fetching ...

SEP: Self-Enhanced Prompt Tuning for Visual-Language Model

Hantao Yao, Rui Zhang, Lu Yu, Yongdong Zhang, Changsheng Xu

TL;DR

Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning, and its self-enhanced tokens not only boost discrimination but also mitigate domain shifts in unseen domains, enhancing generalization.

Abstract

Prompt tuning based on Context Optimization (CoOp) effectively adapts visual-language models (VLMs) to downstream tasks by inferring additional learnable prompt tokens. However, these tokens are less discriminative as they are independent of the pre-trained tokens and fail to capture input-specific knowledge, such as class-aware textual or instance-aware visual knowledge. Leveraging the discriminative and generalization capabilities inherent in pre-trained tokens, we introduce a novel approach named Self-Enhanced Prompt Tuning (SEP). The core principle of SEP involves adapting the learnable prompt tokens at each encoder layer from the corresponding self-pretrained tokens, thereby explicitly incorporating discriminative prior knowledge to enhance both textual-level and visual-level embeddings. Furthermore, SEP's self-enhanced tokens not only boost discrimination but also mitigate domain shifts in unseen domains, enhancing generalization. In practice, SEP selects several representative tokens from all pre-trained tokens for each input data at every layer of the text/visual encoders. Subsequently, a Token Fusion Module (TFM) is introduced to generate a self-enhanced token by merging these representative tokens with the learnable tokens using a cross-attention mechanism. This self-enhanced token is then concatenated with all pre-trained tokens, serving as input for subsequent encoder layers to produce the relevant embeddings. Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning. Code: \href{Code}{https://github.com/htyao89/SEP}.

SEP: Self-Enhanced Prompt Tuning for Visual-Language Model

TL;DR

Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning, and its self-enhanced tokens not only boost discrimination but also mitigate domain shifts in unseen domains, enhancing generalization.

Abstract

Prompt tuning based on Context Optimization (CoOp) effectively adapts visual-language models (VLMs) to downstream tasks by inferring additional learnable prompt tokens. However, these tokens are less discriminative as they are independent of the pre-trained tokens and fail to capture input-specific knowledge, such as class-aware textual or instance-aware visual knowledge. Leveraging the discriminative and generalization capabilities inherent in pre-trained tokens, we introduce a novel approach named Self-Enhanced Prompt Tuning (SEP). The core principle of SEP involves adapting the learnable prompt tokens at each encoder layer from the corresponding self-pretrained tokens, thereby explicitly incorporating discriminative prior knowledge to enhance both textual-level and visual-level embeddings. Furthermore, SEP's self-enhanced tokens not only boost discrimination but also mitigate domain shifts in unseen domains, enhancing generalization. In practice, SEP selects several representative tokens from all pre-trained tokens for each input data at every layer of the text/visual encoders. Subsequently, a Token Fusion Module (TFM) is introduced to generate a self-enhanced token by merging these representative tokens with the learnable tokens using a cross-attention mechanism. This self-enhanced token is then concatenated with all pre-trained tokens, serving as input for subsequent encoder layers to produce the relevant embeddings. Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning. Code: \href{Code}{https://github.com/htyao89/SEP}.
Paper Structure (9 sections, 9 equations, 3 figures, 9 tables)

This paper contains 9 sections, 9 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Comparison with existing framework. (a) Conventional input-irrelevant prompt tuning; (b) Image-conditional prompt tuning; (c) Self-enhanced prompt tuning by injecting the discriminative and generalizable knowledge contained in the frozen tokens.
  • Figure 2: The framework of the proposed Self-Enhanced Prompt tuning. The Token-Fusion Module is used to integrate the pre-trained tokens and the prompt-related tokens for generating the self-enhanced prompt.
  • Figure 3: Effect of $W_v$.