Table of Contents
Fetching ...

Unified Vision and Language Prompt Learning

Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, Chen Change Loy

TL;DR

The paper analyzes the limitations of single-modal prompt tuning for vision-language models and shows inconsistent gains across datasets due to intra-class visual variance and inter-class text variance. It introduces Unified Prompt Tuning (UPT), a multimodal approach that learns a shared set of prompts transformed by a lightweight self-attention module to generate modality-specific prompts for both text and image encoders, with the backbones frozen. Across 11 datasets and settings including few-shot and domain generalization, UPT outperforms text-only and visual-only prompt baselines, often by meaningful margins, and exhibits strong cross-modal alignment. The work underscores the value of multimodal prompt learning and offers a practical, efficient method for adapting large VL models, with public code to facilitate future research.

Abstract

Prompt tuning, a parameter- and data-efficient transfer learning paradigm that tunes only a small number of parameters in a model's input space, has become a trend in the vision community since the emergence of large vision-language models like CLIP. We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning. A major finding is that none of the unimodal prompt tuning methods performs consistently well: text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances. To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities. Extensive experiments on over 11 vision datasets show that UPT achieves a better trade-off than the unimodal counterparts on few-shot learning benchmarks, as well as on domain generalization benchmarks. Code and models will be released to facilitate future research.

Unified Vision and Language Prompt Learning

TL;DR

The paper analyzes the limitations of single-modal prompt tuning for vision-language models and shows inconsistent gains across datasets due to intra-class visual variance and inter-class text variance. It introduces Unified Prompt Tuning (UPT), a multimodal approach that learns a shared set of prompts transformed by a lightweight self-attention module to generate modality-specific prompts for both text and image encoders, with the backbones frozen. Across 11 datasets and settings including few-shot and domain generalization, UPT outperforms text-only and visual-only prompt baselines, often by meaningful margins, and exhibits strong cross-modal alignment. The work underscores the value of multimodal prompt learning and offers a practical, efficient method for adapting large VL models, with public code to facilitate future research.

Abstract

Prompt tuning, a parameter- and data-efficient transfer learning paradigm that tunes only a small number of parameters in a model's input space, has become a trend in the vision community since the emergence of large vision-language models like CLIP. We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning. A major finding is that none of the unimodal prompt tuning methods performs consistently well: text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances. To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities. Extensive experiments on over 11 vision datasets show that UPT achieves a better trade-off than the unimodal counterparts on few-shot learning benchmarks, as well as on domain generalization benchmarks. Code and models will be released to facilitate future research.
Paper Structure (14 sections, 8 equations, 6 figures, 2 tables)

This paper contains 14 sections, 8 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Top: the architecture paradigm of (a)text prompt tuning zhou2021coop, (b)visual prompt tuning jia2022visual and (c) our multi-modal unified prompt tuning (: learnable; : frozen parameters). Bottom: the performance improvements (%) of text prompt tuning (d) and visual prompt tuning (e) compared with the zero-shot CLIP baseline. We show that the variance of visual and text features ($x$-axis) will affect the improvements ($y$-axis). We project the text/visual features of the dataset (pointed by the dashed arrow) into a unit sphere to show the variance of different distributions. Please refer to the appendix for the implementation details about how we compute the feature variance.
  • Figure 2: Visualization of input features ${\bm{z}}$ (projected points) and text classifier ${\mathbf{W}}$ (projected lines) on EuroSAT and Flowers102.
  • Figure 3: The architecture of (a) our unified prompt ${\bm{U}}$ that is applied to (b) CLIP text encoder and (c) CLIP image encoder.
  • Figure 4: Main results over 11 datasets under the few-shot learning setting. We report the average accuracy (%) of 1/2/4/8/16 shots over three runs. Overall, the proposed UPT (blue line) achieves apparent improvements compared with the Zero-shot CLIP and single-modal prompt tuning baselines (CoOp and VPT).
  • Figure 5: Ablation studies on different design choices. (a): jointly train the existing text and visual prompt tuning approaches; (b): shared prompts for all modalities; (c): using two MLP layers to generate the prompts.
  • ...and 1 more figures