Table of Contents
Fetching ...

Cascade Prompt Learning for Vision-Language Model Adaptation

Ge Wu, Xin Zhang, Zheng Li, Zhaowei Chen, Jiajun Liang, Jian Yang, Xiang Li

TL;DR

CasPL introduces a two-phase cascade of prompts for vision-language models to decouple domain-general knowledge from task-specific adaptation. The boosting phase distills broad priors from a larger teacher using unlabeled data, while the adapting phase cascades with frozen boosting prompts to tailor performance on downstream tasks, functioning as a plug-in for existing prompt methods. Empirically, CasPL yields consistent improvements across 11 datasets in base-to-novel generalization, domain generalization, and few-shot settings, outperforming state-of-the-art methods with minimal inference overhead. The approach enables efficient deployment of smaller VLMs in resource-constrained environments and offers a flexible, general framework for enhancing prompt learning in vision-language systems.

Abstract

Prompt learning has surfaced as an effective approach to enhance the performance of Vision-Language Models (VLMs) like CLIP when applied to downstream tasks. However, current learnable prompt tokens are primarily used for the single phase of adapting to tasks (i.e., adapting prompt), easily leading to overfitting risks. In this work, we propose a novel Cascade Prompt Learning CasPL framework to enable prompt learning to serve both generic and specific expertise (i.e., boosting and adapting prompt) simultaneously. Specifically, CasPL is a new learning paradigm comprising two distinct phases of learnable prompts: the first boosting prompt is crafted to extract domain-general knowledge from a senior larger CLIP teacher model by aligning their predicted logits using extensive unlabeled domain images. The second adapting prompt is then cascaded with the frozen first set to fine-tune the downstream tasks, following the approaches employed in prior research. In this manner, CasPL can effectively capture both domain-general and task-specific representations into explicitly different gradual groups of prompts, thus potentially alleviating overfitting issues in the target domain. It's worth noting that CasPL serves as a plug-and-play module that can seamlessly integrate into any existing prompt learning approach. CasPL achieves a significantly better balance between performance and inference speed, which is especially beneficial for deploying smaller VLM models in resource-constrained environments. Compared to the previous state-of-the-art method PromptSRC, CasPL shows an average improvement of 1.85% for base classes, 3.44% for novel classes, and 2.72% for the harmonic mean over 11 image classification datasets. Code is publicly available at: https://github.com/megvii-research/CasPL.

Cascade Prompt Learning for Vision-Language Model Adaptation

TL;DR

CasPL introduces a two-phase cascade of prompts for vision-language models to decouple domain-general knowledge from task-specific adaptation. The boosting phase distills broad priors from a larger teacher using unlabeled data, while the adapting phase cascades with frozen boosting prompts to tailor performance on downstream tasks, functioning as a plug-in for existing prompt methods. Empirically, CasPL yields consistent improvements across 11 datasets in base-to-novel generalization, domain generalization, and few-shot settings, outperforming state-of-the-art methods with minimal inference overhead. The approach enables efficient deployment of smaller VLMs in resource-constrained environments and offers a flexible, general framework for enhancing prompt learning in vision-language systems.

Abstract

Prompt learning has surfaced as an effective approach to enhance the performance of Vision-Language Models (VLMs) like CLIP when applied to downstream tasks. However, current learnable prompt tokens are primarily used for the single phase of adapting to tasks (i.e., adapting prompt), easily leading to overfitting risks. In this work, we propose a novel Cascade Prompt Learning CasPL framework to enable prompt learning to serve both generic and specific expertise (i.e., boosting and adapting prompt) simultaneously. Specifically, CasPL is a new learning paradigm comprising two distinct phases of learnable prompts: the first boosting prompt is crafted to extract domain-general knowledge from a senior larger CLIP teacher model by aligning their predicted logits using extensive unlabeled domain images. The second adapting prompt is then cascaded with the frozen first set to fine-tune the downstream tasks, following the approaches employed in prior research. In this manner, CasPL can effectively capture both domain-general and task-specific representations into explicitly different gradual groups of prompts, thus potentially alleviating overfitting issues in the target domain. It's worth noting that CasPL serves as a plug-and-play module that can seamlessly integrate into any existing prompt learning approach. CasPL achieves a significantly better balance between performance and inference speed, which is especially beneficial for deploying smaller VLM models in resource-constrained environments. Compared to the previous state-of-the-art method PromptSRC, CasPL shows an average improvement of 1.85% for base classes, 3.44% for novel classes, and 2.72% for the harmonic mean over 11 image classification datasets. Code is publicly available at: https://github.com/megvii-research/CasPL.
Paper Structure (18 sections, 2 equations, 7 figures, 13 tables)

This paper contains 18 sections, 2 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Comparison of CasPL with previous prompt learning methods. (a) Previous methods adopt single phase prompting techniques for adapting the domain datasets. (b) CasPL introduces cascaded diverse prompts with multiple functions consisting of both boosting and adapting prompt phases. (c) Performance (HM) of previous prompt learning methods w/ or w/o our CasPL on base-to-novel tasks. The results are the average on 11 datasets.
  • Figure 1: Ablation study on the number of training epochs for the first phase (left) and the choice of temperature hyperparameter in Eq. \ref{['eq_kd']} (right), based on the DTD dataset.
  • Figure 2: An overview of our proposed CasPL framework. (a) We utilize a set of boosting prompts to enable the student CLIP model to extract general domain knowledge from the teacher CLIP model, leveraging an extensive amount of unlabeled domain data. (b) The boosting prompt can be seamlessly incorporated into existing related work as a plug-in. Here, we exemplify this integration with PromptSRC, where frozen boosting prompts are cascaded with learnable adapting prompts without altering any loss function. Further details regarding adaptations to other methods (e.g., CoOp zhou2022learning, CoCoOp zhou2022conditional, MaPLe khattak2023maple) are provided in the Appendix.
  • Figure 2: The detail of CasPL for previous methods. (a) CoOp zhou2022learning employs multiple layers of text-image boosting prompts and a single layer of text adapting prompts. (b) CoCoOp zhou2022conditional utilizes multiple layers of text-image boosting prompts and a single layer of modal blending adapting prompts. (c) MaPLe khattak2023maple uses multiple layers of text-image boosting prompts and multiple layers of multi-modal adapting prompts.
  • Figure 3: CasPL performance comparison in a few-shot image recognition setting. Based on PromptSRC, CasPL achieves the highest performance improvement across all settings. These results emphasize the role of the initial boosting prompt of CasPL in extracting domain generalization capabilities from the senior larger CLIP.
  • ...and 2 more figures