Table of Contents
Fetching ...

PRE: Vision-Language Prompt Learning with Reparameterization Encoder

Thi Minh Anh Pham, An Duc Nguyen, Cephas Svosve, Vasileios Argyriou, Georgios Tzimiropoulos

TL;DR

This work presents Prompt Learning with Reparameterization Encoder (PRE) - a simple and efficient method that enhances the generalization ability of the learnable prompt to unseen classes while maintaining the capacity to learn Base classes.

Abstract

Large pre-trained vision-language models such as CLIP have demonstrated great potential in zero-shot transferability to downstream tasks. However, to attain optimal performance, the manual selection of prompts is necessary to improve alignment between the downstream image distribution and the textual class descriptions. This manual prompt engineering is the major challenge for deploying such models in practice since it requires domain expertise and is extremely time-consuming. To avoid non-trivial prompt engineering, recent work Context Optimization (CoOp) introduced the concept of prompt learning to the vision domain using learnable textual tokens. While CoOp can achieve substantial improvements over manual prompts, its learned context is worse generalizable to wider unseen classes within the same dataset. In this work, we present Prompt Learning with Reparameterization Encoder (PRE) - a simple and efficient method that enhances the generalization ability of the learnable prompt to unseen classes while maintaining the capacity to learn Base classes. Instead of directly optimizing the prompts, PRE employs a prompt encoder to reparameterize the input prompt embeddings, enhancing the exploration of task-specific knowledge from few-shot samples. Experiments and extensive ablation studies on 8 benchmarks demonstrate that our approach is an efficient method for prompt learning. Specifically, PRE achieves a notable enhancement of 5.60% in average accuracy on New classes and 3% in Harmonic mean compared to CoOp in the 16-shot setting, all achieved within a good training time.

PRE: Vision-Language Prompt Learning with Reparameterization Encoder

TL;DR

This work presents Prompt Learning with Reparameterization Encoder (PRE) - a simple and efficient method that enhances the generalization ability of the learnable prompt to unseen classes while maintaining the capacity to learn Base classes.

Abstract

Large pre-trained vision-language models such as CLIP have demonstrated great potential in zero-shot transferability to downstream tasks. However, to attain optimal performance, the manual selection of prompts is necessary to improve alignment between the downstream image distribution and the textual class descriptions. This manual prompt engineering is the major challenge for deploying such models in practice since it requires domain expertise and is extremely time-consuming. To avoid non-trivial prompt engineering, recent work Context Optimization (CoOp) introduced the concept of prompt learning to the vision domain using learnable textual tokens. While CoOp can achieve substantial improvements over manual prompts, its learned context is worse generalizable to wider unseen classes within the same dataset. In this work, we present Prompt Learning with Reparameterization Encoder (PRE) - a simple and efficient method that enhances the generalization ability of the learnable prompt to unseen classes while maintaining the capacity to learn Base classes. Instead of directly optimizing the prompts, PRE employs a prompt encoder to reparameterize the input prompt embeddings, enhancing the exploration of task-specific knowledge from few-shot samples. Experiments and extensive ablation studies on 8 benchmarks demonstrate that our approach is an efficient method for prompt learning. Specifically, PRE achieves a notable enhancement of 5.60% in average accuracy on New classes and 3% in Harmonic mean compared to CoOp in the 16-shot setting, all achieved within a good training time.
Paper Structure (22 sections, 7 equations, 8 figures, 9 tables)

This paper contains 22 sections, 7 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: PRE is depicted in the following illustration: The original prompt embeddings $V$ undergo projection using a trainable prompt encoder $\mathcal{F}(\cdot)$ equipped with a residual connection. This process enables the modeling of domain-specific sequential dependencies within the input prompt embeddings and acquires a new mapping for the soft prompts.
  • Figure 1: The impact of different context length on average Accuracy on Base, New and H over 8 datasets using PRE model.
  • Figure 2: Illustration of one-layer Bidirectional LSTM architecture used in Prompt Encoder.
  • Figure 2: Comparison with vs without prompt initialization on PRE on average Accuracy on Base, New and H over 8 datasets using PRE model.
  • Figure 3: The impact of different reparameterizing encoder network architectures on average Accuracy on Base, New and H over 8 datasets on the 16-shot setting.
  • ...and 3 more figures