Table of Contents
Fetching ...

Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation

Jianyu Zhang, Li Zhang, Shijian Li

TL;DR

The paper tackles OVSS by bridging image-level semantics and pixel-level understanding through Feature Pyramid Tokenization (PAT), which learns multi-resolution, semantic-rich tokens from pretrained VLM feature pyramids using learnable codebooks. PAT employs decoupled pixel and semantic branches connected by a shared decoder, enabling reconstruction and open-vocabulary segmentation with efficient parameter usage. Empirical results on COCO-Stuff, Pascal Context, and ADE20K demonstrate competitive performance and ablations confirm the importance of multi-resolution clustering, decoupled learning, and shared decoding. The approach offers a practical path to leverage VLMs for open vocabulary segmentation, with interpretable tokenization and potential applicability to related vision-language tasks.

Abstract

The visual understanding are often approached from 3 granular levels: image, patch and pixel. Visual Tokenization, trained by self-supervised reconstructive learning, compresses visual data by codebook in patch-level with marginal information loss, but the visual tokens does not have semantic meaning. Open Vocabulary semantic segmentation benefits from the evolving Vision-Language models (VLMs) with strong image zero-shot capability, but transferring image-level to pixel-level understanding remains an imminent challenge. In this paper, we treat segmentation as tokenizing pixels and study a united perceptual and semantic token compression for all granular understanding and consequently facilitate open vocabulary semantic segmentation. Referring to the cognitive process of pretrained VLM where the low-level features are progressively composed to high-level semantics, we propose Feature Pyramid Tokenization (PAT) to cluster and represent multi-resolution feature by learnable codebooks and then decode them by joint learning pixel reconstruction and semantic segmentation. We design loosely coupled pixel and semantic learning branches. The pixel branch simulates bottom-up composition and top-down visualization of codebook tokens, while the semantic branch collectively fuse hierarchical codebooks as auxiliary segmentation guidance. Our experiments show that PAT enhances the semantic intuition of VLM feature pyramid, improves performance over the baseline segmentation model and achieves competitive performance on open vocabulary semantic segmentation benchmark. Our model is parameter-efficient for VLM integration and flexible for the independent tokenization. We hope to give inspiration not only on improving segmentation but also on semantic visual token utilization.

Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation

TL;DR

The paper tackles OVSS by bridging image-level semantics and pixel-level understanding through Feature Pyramid Tokenization (PAT), which learns multi-resolution, semantic-rich tokens from pretrained VLM feature pyramids using learnable codebooks. PAT employs decoupled pixel and semantic branches connected by a shared decoder, enabling reconstruction and open-vocabulary segmentation with efficient parameter usage. Empirical results on COCO-Stuff, Pascal Context, and ADE20K demonstrate competitive performance and ablations confirm the importance of multi-resolution clustering, decoupled learning, and shared decoding. The approach offers a practical path to leverage VLMs for open vocabulary segmentation, with interpretable tokenization and potential applicability to related vision-language tasks.

Abstract

The visual understanding are often approached from 3 granular levels: image, patch and pixel. Visual Tokenization, trained by self-supervised reconstructive learning, compresses visual data by codebook in patch-level with marginal information loss, but the visual tokens does not have semantic meaning. Open Vocabulary semantic segmentation benefits from the evolving Vision-Language models (VLMs) with strong image zero-shot capability, but transferring image-level to pixel-level understanding remains an imminent challenge. In this paper, we treat segmentation as tokenizing pixels and study a united perceptual and semantic token compression for all granular understanding and consequently facilitate open vocabulary semantic segmentation. Referring to the cognitive process of pretrained VLM where the low-level features are progressively composed to high-level semantics, we propose Feature Pyramid Tokenization (PAT) to cluster and represent multi-resolution feature by learnable codebooks and then decode them by joint learning pixel reconstruction and semantic segmentation. We design loosely coupled pixel and semantic learning branches. The pixel branch simulates bottom-up composition and top-down visualization of codebook tokens, while the semantic branch collectively fuse hierarchical codebooks as auxiliary segmentation guidance. Our experiments show that PAT enhances the semantic intuition of VLM feature pyramid, improves performance over the baseline segmentation model and achieves competitive performance on open vocabulary semantic segmentation benchmark. Our model is parameter-efficient for VLM integration and flexible for the independent tokenization. We hope to give inspiration not only on improving segmentation but also on semantic visual token utilization.

Paper Structure

This paper contains 17 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Open Vocabulary Segmentation example. After learning pyramid tokenization by segmentation and reconstruction, the tokenized feature pyramid (row 2) demonstrates vastly improved semantic intuition than the original pretrained representation (row 1) at each stage. The semantic concepts compose from Early low-level color, edge to Mid/Late parts, structures, textures and then finally the Latent objects. At row 3, PAT shows feature clustered into meta semantic tokens which are comprehensive and disentangled.
  • Figure 2: PAT concept. The image-level and pixel-level understanding gap brings difficulty to use VLM for segmentation (i.e. Pixel-Label mapping). PAT clusters and tokenizes feature pyramid to abstract Tokens, then refines semantic progressively to pixel-level based on the easily learned Label-Token mapping. The Tokens become the patch-level bridge between Labels and Pixels.
  • Figure 3: PAT architecture. The global tokens from Side ViT focus on CLIP-aware semantic concepts and local tokens (in PAT VQ modules) learn cluster oriented meta semantic MSMFormer. The decoupled global and local tokens are mutually guided by reconstruction and segmentation as described in Fig \ref{['fig:module']}.
  • Figure 4: Mid stage PAT VQ module. The module decouples the semantic learning (Global Tokens with Local Tokens and Side Features) and pixel decoding (Local Tokens with CLIP Features and Pixel Residual).
  • Figure 5: Accumulated SAN baseline improvements using different VLM.
  • ...and 1 more figures