Revisiting Prompt Pretraining of Vision-Language Models

Zhenyuan Chen; Lingfeng Yang; Shuo Chen; Zhaowei Chen; Jiajun Liang; Xiang Li

Revisiting Prompt Pretraining of Vision-Language Models

Zhenyuan Chen, Lingfeng Yang, Shuo Chen, Zhaowei Chen, Jiajun Liang, Xiang Li

TL;DR

A general framework termed Revisiting Prompt Pretraining (RPP), which targets at improving the fitting and generalization ability from two aspects: prompt structure and prompt supervision, and produces a more resilient prompt initialization, enhancing its robust transferability across diverse visual recognition tasks.

Abstract

Prompt learning is an effective method to customize Vision-Language Models (VLMs) for various downstream tasks, involving tuning very few parameters of input prompt tokens. Recently, prompt pretraining in large-scale dataset (e.g., ImageNet-21K) has played a crucial role in prompt learning for universal visual discrimination. However, we revisit and observe that the limited learnable prompts could face underfitting risks given the extensive images during prompt pretraining, simultaneously leading to poor generalization. To address the above issues, in this paper, we propose a general framework termed Revisiting Prompt Pretraining (RPP), which targets at improving the fitting and generalization ability from two aspects: prompt structure and prompt supervision. For prompt structure, we break the restriction in common practice where query, key, and value vectors are derived from the shared learnable prompt token. Instead, we introduce unshared individual query, key, and value learnable prompts, thereby enhancing the model's fitting capacity through increased parameter diversity. For prompt supervision, we additionally utilize soft labels derived from zero-shot probability predictions provided by a pretrained Contrastive Language Image Pretraining (CLIP) teacher model. These soft labels yield more nuanced and general insights into the inter-class relationships, thereby endowing the pretraining process with better generalization ability. RPP produces a more resilient prompt initialization, enhancing its robust transferability across diverse visual recognition tasks. Experiments across various benchmarks consistently confirm the state-of-the-art (SOTA) performance of our pretrained prompts. Codes and models will be made available soon.

Revisiting Prompt Pretraining of Vision-Language Models

TL;DR

Abstract

Paper Structure (35 sections, 1 theorem, 28 equations, 6 figures, 10 tables)

This paper contains 35 sections, 1 theorem, 28 equations, 6 figures, 10 tables.

Introduction
Related Work
Prompt Learning
CLIP Distillation
Method
Revisiting Prompt Pretraining
Preliminaries
Self-Attention Prompt Learning
Prompt Pretraining with Knowledge Distillation
Theoretical Analysis
Experiments
Quantitative Experiments
Prompt Pretraining on ImageNet-21K.
Zero-shot Image Classification.
Few-shot Image Classification.
...and 20 more sections

Key Result

Theorem 1

Assume that $\Theta ^{*}$ is the solution to Eq. eqn_optim. Then we have that for any $0<\delta <1$ with probability $1-\delta$, where $X^{*}=\mathrm{max}_{r\in \mathbb{N}_{N}}\left | \mathcal{L}\left( \hat{s}_{r}^{S}\left( \Theta \right), y_{r}^{gt} \right) \right |$, and $B_{\lambda}\to 0$ as $\lambda\to +\infty$.

Figures (6)

Figure 1: Our method outperforms previous SOTA models on a broad range of visual recognition tasks and datasets.
Figure 2: An overview of our proposed pretraining framework. Firstly, we propose SAPL Text/Image Encoder, optimizing individual query, key, and value embeddings directly and explicitly (Sec. \ref{['sec4.2']}). Next, we employ a frozen teacher model to supervise the student model's learnable prompts using regularization loss from knowledge distillation (Sec. \ref{['sec4.3']}). Further, we provide theoretical results for the generalization error bound of RPP (Sec. \ref{['sec4.4']}).
Figure 3: Comparative pretraining experiments on training epoch.
Figure 4: Comparative pretraining experiments on training data amount.
Figure 6: A detailed description of our proposed SAPL prompt structure. (a) Existing methods adopt uni-modal prompting techniques to fine-tune CLIP representations as prompts are learned only in a single branch of CLIP (language or vision). (b) Our SAPL explicitly and directly optimizes the individual query, key, and value embeddings. (c) Detailed description of SAPL at each layer.
...and 1 more figures

Theorems & Definitions (3)

Theorem 1
proof
proof

Revisiting Prompt Pretraining of Vision-Language Models

TL;DR

Abstract

Revisiting Prompt Pretraining of Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (3)