Table of Contents
Fetching ...

Prompt Tuning for CLIP on the Pretrained Manifold

Xi Yang, Yuanrong Xu, Weigang Zhang, Guangming Lu, David Zhang, Jie Wen

TL;DR

ManiPT introduces cosine consistency constraints in both the text and image modalities to confine the learned representations within the pretrained geometric neighborhood and introduces a structural bias that enforces incremental corrections, guiding the adaptation along transferable directions to mitigate reliance on shortcut learning.

Abstract

Prompt tuning introduces learnable prompt vectors that adapt pretrained vision-language models to downstream tasks in a parameter-efficient manner. However, under limited supervision, prompt tuning alters pretrained representations and drives downstream features away from the pretrained manifold toward directions that are unfavorable for transfer. This drift degrades generalization. To address this limitation, we propose ManiPT, a framework that performs prompt tuning on the pretrained manifold. ManiPT introduces cosine consistency constraints in both the text and image modalities to confine the learned representations within the pretrained geometric neighborhood. Furthermore, we introduce a structural bias that enforces incremental corrections, guiding the adaptation along transferable directions to mitigate reliance on shortcut learning. From a theoretical perspective, ManiPT alleviates overfitting tendencies under limited data. Our experiments cover four downstream settings: unseen-class generalization, few-shot classification, cross-dataset transfer, and domain generalization. Across these settings, ManiPT achieves higher average performance than baseline methods. Notably, ManiPT provides an explicit perspective on how prompt tuning overfits under limited supervision.

Prompt Tuning for CLIP on the Pretrained Manifold

TL;DR

ManiPT introduces cosine consistency constraints in both the text and image modalities to confine the learned representations within the pretrained geometric neighborhood and introduces a structural bias that enforces incremental corrections, guiding the adaptation along transferable directions to mitigate reliance on shortcut learning.

Abstract

Prompt tuning introduces learnable prompt vectors that adapt pretrained vision-language models to downstream tasks in a parameter-efficient manner. However, under limited supervision, prompt tuning alters pretrained representations and drives downstream features away from the pretrained manifold toward directions that are unfavorable for transfer. This drift degrades generalization. To address this limitation, we propose ManiPT, a framework that performs prompt tuning on the pretrained manifold. ManiPT introduces cosine consistency constraints in both the text and image modalities to confine the learned representations within the pretrained geometric neighborhood. Furthermore, we introduce a structural bias that enforces incremental corrections, guiding the adaptation along transferable directions to mitigate reliance on shortcut learning. From a theoretical perspective, ManiPT alleviates overfitting tendencies under limited data. Our experiments cover four downstream settings: unseen-class generalization, few-shot classification, cross-dataset transfer, and domain generalization. Across these settings, ManiPT achieves higher average performance than baseline methods. Notably, ManiPT provides an explicit perspective on how prompt tuning overfits under limited supervision.
Paper Structure (45 sections, 6 theorems, 62 equations, 12 figures, 19 tables, 2 algorithms)

This paper contains 45 sections, 6 theorems, 62 equations, 12 figures, 19 tables, 2 algorithms.

Key Result

Lemma 4.2

Let $d$ be the embedding dimension and let $\mathbb S^{d-1}=\{\mathbf{z}\in\mathbb R^d:\|\mathbf{z}\|_2=1\}$ be the unit sphere. Let $\boldsymbol{\varphi}\in\mathbb S^{d-1}$ and $\boldsymbol{\psi}\in\mathbb S^{d-1}$ satisfy $\boldsymbol{\varphi}+\boldsymbol{\psi}\neq 0$, which is equivalent to $\lan

Figures (12)

  • Figure 1: Manifold drift and manifold preservation on the CLIP feature space. (a) Prompt tuning. Under limited supervision, prompt tuning drives adapted representations away from the pretrained manifold. (b) ManiPT. The proposed method constrains adapted representations to stay close to the pretrained manifold.
  • Figure 2: Overview of ManiPT. The framework enriches class descriptions with an LLM, constructs a text feature bank as semantic prototypes, and applies cosine consistency constraints together with a structural bias to keep prompt tuning near the pretrained manifold.
  • Figure 3: Few-shot classification performance averaged over 11 datasets under 1, 2, 4, 8, and 16 shot settings. ManiPT consistently outperforms baseline methods across all shot numbers, with especially clear gains in the 1-shot and 2-shot regimes.
  • Figure 4: Quantitative analysis of manifold drift using PCA. We compare the distance between prompt-adapted features and pretrained features for different methods.
  • Figure 5: Sensitivity analysis of the consistency constraint weight.
  • ...and 7 more figures

Theorems & Definitions (9)

  • Definition 4.1: Empirical Risk
  • Lemma 4.2: Contraction Induced by Additive Fusion
  • Lemma 4.3
  • Corollary 4.4
  • Lemma 2.1
  • proof
  • Lemma 2.2
  • proof
  • Theorem 2.3