Table of Contents
Fetching ...

Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

Shuhuai Ren, Aston Zhang, Yi Zhu, Shuai Zhang, Shuai Zheng, Mu Li, Alex Smola, Xu Sun

TL;DR

Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition (POMP) pre-trains a universal soft prompt on ImageNet-21K to condense semantic information across 20k+ classes. It introduces Local Contrast (class-subset sampling) and Local Correction (adaptive margin) to achieve memory-efficient, scalable pre-training and robust generalization for zero-shot open-vocabulary tasks. Empirically, POMP delivers state-of-the-art results across image classification, open-vocabulary semantic segmentation, and open-vocabulary object detection, with notable efficiency gains over prior prompt-tuning approaches. This work provides a scalable path toward universal perceptual grounding in vision-language models, enabling straightforward zero-shot deployment across diverse downstream tasks.

Abstract

This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 datasets, e.g., 67.0% average accuracy on 10 classification datasets (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg). Our code is available at https://github.com/amazon-science/prompt-pretraining.

Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

TL;DR

Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition (POMP) pre-trains a universal soft prompt on ImageNet-21K to condense semantic information across 20k+ classes. It introduces Local Contrast (class-subset sampling) and Local Correction (adaptive margin) to achieve memory-efficient, scalable pre-training and robust generalization for zero-shot open-vocabulary tasks. Empirically, POMP delivers state-of-the-art results across image classification, open-vocabulary semantic segmentation, and open-vocabulary object detection, with notable efficiency gains over prior prompt-tuning approaches. This work provides a scalable path toward universal perceptual grounding in vision-language models, enabling straightforward zero-shot deployment across diverse downstream tasks.

Abstract

This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 datasets, e.g., 67.0% average accuracy on 10 classification datasets (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg). Our code is available at https://github.com/amazon-science/prompt-pretraining.
Paper Structure (35 sections, 11 equations, 9 figures, 12 tables)

This paper contains 35 sections, 11 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: POMP outperforms previous state-of-the-art models on a broad range of visual recognition tasks and datasets.
  • Figure 2: Overview of POMP. POMP pre-trains a soft prompt (:learnable) on the ImgaNet-21K dataset with massive classes, and then directly transfers the learned prompt (:frozen) to downstream datasets of image classification (CLS), object detection (DET), and semantic segmentation (SEG) tasks. For DET and SEG, the region and mask proposal networks require pre-training with POMP prompt on detection and segmentation source data, respectively (See Appendix \ref{['sec:setting']}).
  • Figure 3: GPU memory overhead (GB) required for prompt tuning on datasets with varying numbers of classes. The memory cost of CoOp on ImageNet-21K is $316.4$ GB, which is generally prohibitive. POMP reduces the cost dramatically to $15.7$ GB with local contrast among the $1000$ sampled classes for optimization.
  • Figure 4: Comparison with state-of-the-art methods on COCO Stuff dataset and Pascal VOC dataset. POMP and ZSSeg share the same mask proposal network and training strategy.
  • Figure 5: $\ell_{\text{align}}$ and $\ell_{\text{uniform}}$ of POMP. For both measures, lower numbers are better. The color of circles and the numbers in the boxes denote the average cross-dataset accuracy over $10$ datasets (higher is better).
  • ...and 4 more figures