Unified Vision and Language Prompt Learning
Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, Chen Change Loy
TL;DR
The paper analyzes the limitations of single-modal prompt tuning for vision-language models and shows inconsistent gains across datasets due to intra-class visual variance and inter-class text variance. It introduces Unified Prompt Tuning (UPT), a multimodal approach that learns a shared set of prompts transformed by a lightweight self-attention module to generate modality-specific prompts for both text and image encoders, with the backbones frozen. Across 11 datasets and settings including few-shot and domain generalization, UPT outperforms text-only and visual-only prompt baselines, often by meaningful margins, and exhibits strong cross-modal alignment. The work underscores the value of multimodal prompt learning and offers a practical, efficient method for adapting large VL models, with public code to facilitate future research.
Abstract
Prompt tuning, a parameter- and data-efficient transfer learning paradigm that tunes only a small number of parameters in a model's input space, has become a trend in the vision community since the emergence of large vision-language models like CLIP. We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning. A major finding is that none of the unimodal prompt tuning methods performs consistently well: text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances. To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities. Extensive experiments on over 11 vision datasets show that UPT achieves a better trade-off than the unimodal counterparts on few-shot learning benchmarks, as well as on domain generalization benchmarks. Code and models will be released to facilitate future research.
