Table of Contents
Fetching ...

Subobject-level Image Tokenization

Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, Pascale Fung

TL;DR

This work addresses the limitation of patch-based image tokenization by introducing subobject-level adaptive tokenization, epitomized by the EPOC tokenizer that combines boundary detection with watershed segmentation to achieve panoptic, monosemantic tokens with low computational overhead. Through extensive intrinsic evaluations, EPOC demonstrates strong alignment with human morphology and superior efficiency relative to SAM-based approaches. Extrinsic evaluation shows that subobject tokenization—particularly EPOC—facilitates faster convergence and better generalization for vision-language models across multiple datasets, using fewer visual tokens and proving robust to token reduction. The results underscore the practical impact of adaptive segmentation for scalable image understanding and captioning tasks, while highlighting opportunities for further improvements in boundary estimation and downstream integration.

Abstract

Patch-based image tokenization ignores the morphology of the visual world, limiting effective and efficient learning of image understanding. Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation and explore several approaches, including superpixel, SAM, and a proposed Efficient and PanOptiC (EPOC) image tokenizer. Our EPOC combines boundary detection -- a simple task that can be handled well by a compact model -- with watershed segmentation, which inherently guarantees no pixels are left unsegmented. Intrinsic evaluations across 5 datasets demonstrate that EPOC's segmentation aligns well with human annotations of both object- and part-level visual morphology, producing more monosemantic tokens and offering substantial efficiency advantages. For extrinsic evaluation, we designed a token embedding that handles arbitrary-shaped tokens, and trained VLMs with different tokenizers on 4 datasets of object recognition and detailed captioning. The results reveal that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.

Subobject-level Image Tokenization

TL;DR

This work addresses the limitation of patch-based image tokenization by introducing subobject-level adaptive tokenization, epitomized by the EPOC tokenizer that combines boundary detection with watershed segmentation to achieve panoptic, monosemantic tokens with low computational overhead. Through extensive intrinsic evaluations, EPOC demonstrates strong alignment with human morphology and superior efficiency relative to SAM-based approaches. Extrinsic evaluation shows that subobject tokenization—particularly EPOC—facilitates faster convergence and better generalization for vision-language models across multiple datasets, using fewer visual tokens and proving robust to token reduction. The results underscore the practical impact of adaptive segmentation for scalable image understanding and captioning tasks, while highlighting opportunities for further improvements in boundary estimation and downstream integration.

Abstract

Patch-based image tokenization ignores the morphology of the visual world, limiting effective and efficient learning of image understanding. Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation and explore several approaches, including superpixel, SAM, and a proposed Efficient and PanOptiC (EPOC) image tokenizer. Our EPOC combines boundary detection -- a simple task that can be handled well by a compact model -- with watershed segmentation, which inherently guarantees no pixels are left unsegmented. Intrinsic evaluations across 5 datasets demonstrate that EPOC's segmentation aligns well with human annotations of both object- and part-level visual morphology, producing more monosemantic tokens and offering substantial efficiency advantages. For extrinsic evaluation, we designed a token embedding that handles arbitrary-shaped tokens, and trained VLMs with different tokenizers on 4 datasets of object recognition and detailed captioning. The results reveal that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.
Paper Structure (20 sections, 15 figures, 1 table)

This paper contains 20 sections, 15 figures, 1 table.

Figures (15)

  • Figure 1: Comparing SAM and our EPOC on SA-1B images. The design of independent mask decoding makes SAM often leave thin gaps between segments or background regions unsegmented. Our EPOC inherently guarantees complete coverage while also improves computational efficiency.
  • Figure 2: Proposed EPOC. A boundary probability map $\mathbf{P} \in [0, 1]^{H \times W}$ is predicted from the input image $\mathbf{X}\in \mathbb{R}^{H\times W \times 3}$ and treated as a topographical surface. The watershed segmentation begins by identifying basins in $\mathbf{P}$ as seed regions (labeled in different colors) with a threshold t. A "flooding" process then progresses until the entire $\mathbf{P}$ is submerged. When seed regions meet during flooding, "watersheds" are formed to separate them.
  • Figure 3: Intrinsic evaluation dataset examples and token segmentation results. Object-level tokenization based on panoptic segmentation suffers from out-of-vocabulary problem. Superpixel segmentation relies on bottom-up pixel grouping, which limits its ability to capture underlying structures. The SAM model and its variants generally provide reasonable token segmentation. The quality and style of the segmentation generated by our EPOC closely match those of SAM, while utilizing a significantly smaller model size.
  • Figure 4: Intrinsic evaluation of token segmentation. Connected dots represent same model in different sizes or with different hyperparameters. Top: We measure the alignment between token segmentation and semantic annotations with boundary precision and recall. Our proposed EPOC achieves Pareto optimality compared to SAM ViT-B models and matches the performance of FastSAM and MobileSAMv2. Bottom: All subobject-level methods demonstrate clear advantages over object-level (in maximum achievable monosemanticity score) and static patch-based tokenization (in token efficiency).
  • Figure 5: Extrinsic evaluation of token segmentation. (a-d): Adaptive token segmentation shows clear advantage over patch tokenization, with subobject-level SLIC and EPOC being able to approach lower perplexity than object-level ones. (e): when using VAE embedding which is less semantically expressive, patch-based model failed to converge, while adaptive tokenizers works fine.
  • ...and 10 more figures