Table of Contents
Fetching ...

Perceptual Group Tokenizer: Building Perception with Iterative Grouping

Zhiwei Deng, Ting Chen, Yang Li

TL;DR

The paper tackles whether perceptual grouping can underlie a strong self-supervised visual backbone. It introduces Perceptual Group Tokenizer (PGT), which iteratively binds input tokens to multiple group tokens via multi-grouping heads to form context-rich representations, trained with a moving-average teacher loss. On ImageNet-1K, PGT achieves competitive results (80.3% top-1 with linear probe) and offers adaptive computation and interpretability, with successful transfer to ADE20k segmentation and insightful visualizations of group-token interactions. While effective, the approach incurs substantial computation due to iterative grouping, suggesting avenues for more efficient or closed-form grouping in future work.

Abstract

Human visual recognition system shows astonishing capability of compressing visual information into a set of tokens containing rich representations without label supervision. One critical driving principle behind it is perceptual grouping. Despite being widely used in computer vision in the early 2010s, it remains a mystery whether perceptual grouping can be leveraged to derive a neural visual recognition backbone that generates as powerful representations. In this paper, we propose the Perceptual Group Tokenizer, a model that entirely relies on grouping operations to extract visual features and perform self-supervised representation learning, where a series of grouping operations are used to iteratively hypothesize the context for pixels or superpixels to refine feature representations. We show that the proposed model can achieve competitive performance compared to state-of-the-art vision architectures, and inherits desirable properties including adaptive computation without re-training, and interpretability. Specifically, Perceptual Group Tokenizer achieves 80.3% on ImageNet-1K self-supervised learning benchmark with linear probe evaluation, marking a new progress under this paradigm.

Perceptual Group Tokenizer: Building Perception with Iterative Grouping

TL;DR

The paper tackles whether perceptual grouping can underlie a strong self-supervised visual backbone. It introduces Perceptual Group Tokenizer (PGT), which iteratively binds input tokens to multiple group tokens via multi-grouping heads to form context-rich representations, trained with a moving-average teacher loss. On ImageNet-1K, PGT achieves competitive results (80.3% top-1 with linear probe) and offers adaptive computation and interpretability, with successful transfer to ADE20k segmentation and insightful visualizations of group-token interactions. While effective, the approach incurs substantial computation due to iterative grouping, suggesting avenues for more efficient or closed-form grouping in future work.

Abstract

Human visual recognition system shows astonishing capability of compressing visual information into a set of tokens containing rich representations without label supervision. One critical driving principle behind it is perceptual grouping. Despite being widely used in computer vision in the early 2010s, it remains a mystery whether perceptual grouping can be leveraged to derive a neural visual recognition backbone that generates as powerful representations. In this paper, we propose the Perceptual Group Tokenizer, a model that entirely relies on grouping operations to extract visual features and perform self-supervised representation learning, where a series of grouping operations are used to iteratively hypothesize the context for pixels or superpixels to refine feature representations. We show that the proposed model can achieve competitive performance compared to state-of-the-art vision architectures, and inherits desirable properties including adaptive computation without re-training, and interpretability. Specifically, Perceptual Group Tokenizer achieves 80.3% on ImageNet-1K self-supervised learning benchmark with linear probe evaluation, marking a new progress under this paradigm.
Paper Structure (25 sections, 3 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 25 sections, 3 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Perceptual Group Tokenizer is entirely driven by grouping operations to perform representation learning. Group tokens (discovered objects) are shown above. See more in the appendix.
  • Figure 2: Perceptual Group Tokenizer takes in a sequence of patches (or pixels), generates high-dimensional embedding vectors for all patches, then them passes through a series of grouping layers to refine the embedding vectors as feature representations. Each grouping layer performs $K$ rounds of binding from input tokens to group tokens. To consider various grouping possibilities, multiple grouping heads are adopted. Each group token provides a useful context for input tokens for feature refinement. The final output of the model contains refined input token, group tokens, and assignments between input tokens and groups tokens.
  • Figure 3: Operation comparison.
  • Figure 4: The entropy curves of grouping distributions $p(\boldsymbol{c})$ and $p(\boldsymbol{c}|\boldsymbol{x})$ across different layers.
  • Figure 5: Visualization of attention maps of each group tokens across layers and grouping head. $L$ indicates layer indices. Five group tokens for each grouping head. Smaller images are for early layers, arranged as five group tokens per grouping head. Large images are for the last layer.
  • ...and 5 more figures