Table of Contents
Fetching ...

Progressive Visual Prompt Learning with Contrastive Feature Re-formation

Chen Xu, Yuhan Zhu, Haocheng Shen, Boheng Chen, Yixuan Liao, Xiaoxin Chen, Limin Wang

TL;DR

This work targets prompt-based adaptation of Vision-Language models and introduces Progressive Visual Prompt (ProVP) to foster cross-layer interactions in the image encoder, along with Contrastive Feature Re-formation (Ref) to preserve pre-trained CLIP feature distributions during downstream learning. The combination, named ProVP-Ref, yields strong adaptation and generalization across 11 image datasets, achieving 7/11 state-of-the-art results in few-shot and base-to-new settings and demonstrating notable gains on domains with substantial distribution shifts. Ablation and analysis show the benefits of learning instance-specific prompts and constraining features in the learned space, and the authors also propose a multi-modal extension (ProVP*$-Ref) that further improves performance. Overall, the study highlights the viability and advantages of visual prompts in Vision-Language models for robust open-set recognition, with practical implications for adapting large V-L models to diverse downstream tasks.

Abstract

Prompt learning has been designed as an alternative to fine-tuning for adapting Vision-language (V-L) models to the downstream tasks. Previous works mainly focus on text prompt while visual prompt works are limited for V-L models. The existing visual prompt methods endure either mediocre performance or unstable training process, indicating the difficulty of visual prompt learning. In this paper, we propose a new Progressive Visual Prompt (ProVP) structure to strengthen the interactions among prompts of different layers. More importantly, our ProVP could effectively propagate the image embeddings to deep layers and behave partially similar to an instance adaptive prompt method. To alleviate generalization deterioration, we further propose a new contrastive feature re-formation, which prevents the serious deviation of the prompted visual feature from the fixed CLIP visual feature distribution. Combining both, our method (ProVP-Ref) is evaluated on 11 image benchmark datasets and achieves 7/11 state-of-theart results on both few-shot and base-to-novel settings. To the best of our knowledge, we are the first to demonstrate the superior performance of visual prompts in V-L models to previous prompt-based methods in downstream tasks. Meanwhile, it implies that our ProVP-Ref shows the best capability to adapt and to generalize.

Progressive Visual Prompt Learning with Contrastive Feature Re-formation

TL;DR

This work targets prompt-based adaptation of Vision-Language models and introduces Progressive Visual Prompt (ProVP) to foster cross-layer interactions in the image encoder, along with Contrastive Feature Re-formation (Ref) to preserve pre-trained CLIP feature distributions during downstream learning. The combination, named ProVP-Ref, yields strong adaptation and generalization across 11 image datasets, achieving 7/11 state-of-the-art results in few-shot and base-to-new settings and demonstrating notable gains on domains with substantial distribution shifts. Ablation and analysis show the benefits of learning instance-specific prompts and constraining features in the learned space, and the authors also propose a multi-modal extension (ProVP*$-Ref) that further improves performance. Overall, the study highlights the viability and advantages of visual prompts in Vision-Language models for robust open-set recognition, with practical implications for adapting large V-L models to diverse downstream tasks.

Abstract

Prompt learning has been designed as an alternative to fine-tuning for adapting Vision-language (V-L) models to the downstream tasks. Previous works mainly focus on text prompt while visual prompt works are limited for V-L models. The existing visual prompt methods endure either mediocre performance or unstable training process, indicating the difficulty of visual prompt learning. In this paper, we propose a new Progressive Visual Prompt (ProVP) structure to strengthen the interactions among prompts of different layers. More importantly, our ProVP could effectively propagate the image embeddings to deep layers and behave partially similar to an instance adaptive prompt method. To alleviate generalization deterioration, we further propose a new contrastive feature re-formation, which prevents the serious deviation of the prompted visual feature from the fixed CLIP visual feature distribution. Combining both, our method (ProVP-Ref) is evaluated on 11 image benchmark datasets and achieves 7/11 state-of-theart results on both few-shot and base-to-novel settings. To the best of our knowledge, we are the first to demonstrate the superior performance of visual prompts in V-L models to previous prompt-based methods in downstream tasks. Meanwhile, it implies that our ProVP-Ref shows the best capability to adapt and to generalize.
Paper Structure (20 sections, 9 equations, 10 figures, 8 tables)

This paper contains 20 sections, 9 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Absolute gains of ProVP-Ref over CoCoOp CoCoOp in base-to-new generalization on 8 datasets. ProVP-Ref demonstrates a significant improvement in both base and new categories.
  • Figure 2: Test accuracy curve of VPT-Deep and ProVP during training (we select the results of 8-shot learning on ImageNet). The Deep version of VPT shows serious training instability as the tested performance could be dropped several times even close to 0.
  • Figure 3: An overview of ProVP-Ref. Right: a full pipeline for our approach. It consists of two main parts Progressive Visual Prompt, in which we retain the outputs of the prompt via a progressive connection, and Contrastive Feature Re-formation: Using the frozen image encoder and the original image input, we can reformate the prompted features so that they can constitute a more similar representation with pre-trained CLIP. Left: a detailed view of Progressive Visual Prompt, new prompts will be combined with the output of the former embeddings before being sent to the encoder layer.
  • Figure 4: Full comparison of ProVP-Ref with previous methods on few-shot learning. ProVP-Ref shows significant gains in 7/11 datasets and highly improves the average performance.
  • Figure 5: T-SNE plots of image embeddings in VPT, ProVP, CoCoOp, and ProVP-Ref are presented on two diverse datasets (Noticing that neither CoOp, CoCoOp, nor zero-shot CLIP learns on the image branch. Thus, they share the same image feature visualization). It is evident that the embeddings in ProVP and ProVP-Ref are more separable. Furthermore, since ProVP-Ref has learned from the combination of pre-trained and downstream knowledge, it shares a more similar feature representation to zero-shot CLIP, providing a better generalization capability.
  • ...and 5 more figures