Tokenize Anything via Prompting
Ting Pan, Lulu Tang, Xinlong Wang, Shiguang Shan
TL;DR
We address the need for a single vision system that can simultaneously localize, recognize, and describe arbitrary regions. The authors introduce TAP, a promptable tokenizer that augments SAM with a semantic token per mask and leverages SemanticSA-1B together with EVA-CLIP priors to learn region-level semantics via concept prediction and promptable segmentation. The model achieves state-of-the-art open-world instance classification on LVIS (e.g., strong zero-shot performance) and sets a new CIDEr record for region captioning on Visual Genome (CIDEr = 164.7) with a compact text decoder, while maintaining competitive segmentation quality. Key contributions include a unified data-and-model framework (SemanticSA-1B), a joint pre-training objective combining $ abla \,\mathcal{L}_{\text{concept}}$ and $\mathcal{L}_{\text{seg}}$, and a lightweight causal decoder enabling region captions without large LLMs. The work demonstrates that promptable region-level representations can generalize across segmentation, recognition, and captioning, offering a versatile, region-wise tokenizer with practical impact for open-world perception tasks, to be leveraged in vision-language systems and downstream reasoning.
Abstract
We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything. Unlike SAM, we aim to build a versatile region representation in the wild via visual prompting. To achieve this, we train a generalizable model with massive segmentation masks, \eg, SA-1B masks, and semantic priors from a pre-trained CLIP model with 5 billion parameters. Specifically, we construct a promptable image decoder by adding a semantic token to each mask token. The semantic token is responsible for learning the semantic priors in a predefined concept space. Through joint optimization of segmentation on mask tokens and concept prediction on semantic tokens, our model exhibits strong regional recognition and localization capabilities. For example, an additional 38M-parameter causal text decoder trained from scratch sets a new record with a CIDEr score of 164.7 on the Visual Genome region captioning task. We believe this model can be a versatile region-level image tokenizer, capable of encoding general-purpose region context for a broad range of visual perception tasks. Code and models are available at {\footnotesize \url{https://github.com/baaivision/tokenize-anything}}.
