Table of Contents
Fetching ...

Tokenize Anything via Prompting

Ting Pan, Lulu Tang, Xinlong Wang, Shiguang Shan

TL;DR

We address the need for a single vision system that can simultaneously localize, recognize, and describe arbitrary regions. The authors introduce TAP, a promptable tokenizer that augments SAM with a semantic token per mask and leverages SemanticSA-1B together with EVA-CLIP priors to learn region-level semantics via concept prediction and promptable segmentation. The model achieves state-of-the-art open-world instance classification on LVIS (e.g., strong zero-shot performance) and sets a new CIDEr record for region captioning on Visual Genome (CIDEr = 164.7) with a compact text decoder, while maintaining competitive segmentation quality. Key contributions include a unified data-and-model framework (SemanticSA-1B), a joint pre-training objective combining $ abla \,\mathcal{L}_{\text{concept}}$ and $\mathcal{L}_{\text{seg}}$, and a lightweight causal decoder enabling region captions without large LLMs. The work demonstrates that promptable region-level representations can generalize across segmentation, recognition, and captioning, offering a versatile, region-wise tokenizer with practical impact for open-world perception tasks, to be leveraged in vision-language systems and downstream reasoning.

Abstract

We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything. Unlike SAM, we aim to build a versatile region representation in the wild via visual prompting. To achieve this, we train a generalizable model with massive segmentation masks, \eg, SA-1B masks, and semantic priors from a pre-trained CLIP model with 5 billion parameters. Specifically, we construct a promptable image decoder by adding a semantic token to each mask token. The semantic token is responsible for learning the semantic priors in a predefined concept space. Through joint optimization of segmentation on mask tokens and concept prediction on semantic tokens, our model exhibits strong regional recognition and localization capabilities. For example, an additional 38M-parameter causal text decoder trained from scratch sets a new record with a CIDEr score of 164.7 on the Visual Genome region captioning task. We believe this model can be a versatile region-level image tokenizer, capable of encoding general-purpose region context for a broad range of visual perception tasks. Code and models are available at {\footnotesize \url{https://github.com/baaivision/tokenize-anything}}.

Tokenize Anything via Prompting

TL;DR

We address the need for a single vision system that can simultaneously localize, recognize, and describe arbitrary regions. The authors introduce TAP, a promptable tokenizer that augments SAM with a semantic token per mask and leverages SemanticSA-1B together with EVA-CLIP priors to learn region-level semantics via concept prediction and promptable segmentation. The model achieves state-of-the-art open-world instance classification on LVIS (e.g., strong zero-shot performance) and sets a new CIDEr record for region captioning on Visual Genome (CIDEr = 164.7) with a compact text decoder, while maintaining competitive segmentation quality. Key contributions include a unified data-and-model framework (SemanticSA-1B), a joint pre-training objective combining and , and a lightweight causal decoder enabling region captions without large LLMs. The work demonstrates that promptable region-level representations can generalize across segmentation, recognition, and captioning, offering a versatile, region-wise tokenizer with practical impact for open-world perception tasks, to be leveraged in vision-language systems and downstream reasoning.

Abstract

We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything. Unlike SAM, we aim to build a versatile region representation in the wild via visual prompting. To achieve this, we train a generalizable model with massive segmentation masks, \eg, SA-1B masks, and semantic priors from a pre-trained CLIP model with 5 billion parameters. Specifically, we construct a promptable image decoder by adding a semantic token to each mask token. The semantic token is responsible for learning the semantic priors in a predefined concept space. Through joint optimization of segmentation on mask tokens and concept prediction on semantic tokens, our model exhibits strong regional recognition and localization capabilities. For example, an additional 38M-parameter causal text decoder trained from scratch sets a new record with a CIDEr score of 164.7 on the Visual Genome region captioning task. We believe this model can be a versatile region-level image tokenizer, capable of encoding general-purpose region context for a broad range of visual perception tasks. Code and models are available at {\footnotesize \url{https://github.com/baaivision/tokenize-anything}}.
Paper Structure (47 sections, 3 equations, 9 figures, 7 tables, 1 algorithm)

This paper contains 47 sections, 3 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: TAP is a unified and promptable foundation model capable of simultaneously segmenting, recognizing, and captioning arbitrary regions, with flexible visual prompts (point, box and sketch). Following SAM sam, we upgrade its mask decoder to be a versatile image decoder by adding one semantic token for each predicted mask. The model is trained with exhaustive segmentation masks sourced from SA-1B, coupled with semantic priors from a pre-trained EVA-CLIP sun2023evaclip with 5 billion parameters.
  • Figure 2: TAP accepts flexible prompts and outputs mask, category and caption at once.
  • Figure 3: Overview of TAP. a) Building upon SAM's architecture, we enhance the mask decoder to a generic image decoder, adding an additional semantic token [S] to each predicted mask. b) Our model is pre-trained on SemanticSA-1B, jointly optimized for concept prediction and promptable segmentation. c) Subsequently, the pre-trained promptable tokenizer (in dotted box) is employed for region captioning.
  • Figure 4: Promptable captioning. Semantic token is used to prompt text generation.
  • Figure 5: Visualization of understanding open-world knowledge.
  • ...and 4 more figures