Table of Contents
Fetching ...

TCFormer: Visual Recognition via Token Clustering Transformer

Wang Zeng, Sheng Jin, Lumin Xu, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo, Xiaogang Wang

TL;DR

TCFormer introduces a dynamic-token vision transformer that forms tokens based on semantic meaning rather than a fixed grid. Through CTM clustering and the MTA module, it builds a multi-scale token pyramid that preserves detail while focusing computation on informative regions; TCFormerV2 adds Local CTM and CR-MTA to boost efficiency and accuracy. The approach achieves strong results across classification, pose estimation, semantic segmentation, and object detection, often surpassing grid-based transformers with lower computational cost. This dynamic-token design enhances object relationship learning and detail capture, offering practical advantages for diverse vision tasks and paving the way for further dynamic-token architectures.

Abstract

Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token distribution disregards the semantic meaning of different image regions, resulting in sub-optimal performance. To address this issue, we propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning. Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens. Through extensive experimentation across various applications, including image classification, human pose estimation, semantic segmentation, and object detection, we demonstrate the effectiveness of our TCFormer. The code and models for this work are available at https://github.com/zengwang430521/TCFormer.

TCFormer: Visual Recognition via Token Clustering Transformer

TL;DR

TCFormer introduces a dynamic-token vision transformer that forms tokens based on semantic meaning rather than a fixed grid. Through CTM clustering and the MTA module, it builds a multi-scale token pyramid that preserves detail while focusing computation on informative regions; TCFormerV2 adds Local CTM and CR-MTA to boost efficiency and accuracy. The approach achieves strong results across classification, pose estimation, semantic segmentation, and object detection, often surpassing grid-based transformers with lower computational cost. This dynamic-token design enhances object relationship learning and detail capture, offering practical advantages for diverse vision tasks and paving the way for further dynamic-token architectures.

Abstract

Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token distribution disregards the semantic meaning of different image regions, resulting in sub-optimal performance. To address this issue, we propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning. Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens. Through extensive experimentation across various applications, including image classification, human pose estimation, semantic segmentation, and object detection, we demonstrate the effectiveness of our TCFormer. The code and models for this work are available at https://github.com/zengwang430521/TCFormer.
Paper Structure (24 sections, 4 equations, 16 figures, 9 tables)

This paper contains 24 sections, 4 equations, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Comparisons of different vision token distribution. Image regions with the same color are represented by the same vision token. Both prior isotropic(a) and pyramid(b) vision transformers treat all regions equally and disregard the differences in semantic meaning. In contrast, our TCFormer(c) generates dynamic vision tokens with flexible shapes and sizes based on the semantic meaning. For the background regions, a single token (in blue) represents a large region, while for informative regions, more tokens (in green and red) are assigned. For image details, tokens with fine spatial sizes are employed (in red).
  • Figure 2: Architecture of our Token Clustering Transformer (TCFormer). TCFormer adopts a widely utilized pyramid structure and consists of four stages. The vision tokens in the initial stage are generated from the pixels in a high-resolution feature map. Between consecutive stages, the Clustering-based Token Merge (CTM) module merges vision tokens to create dynamic tokens for the subsequent stage. The Multi-stage Token Aggregation (MTA) module integrates multi-scale token features in token format and outputs a token pyramid for further processing.
  • Figure 3: (a) Structure of the transformer blocks in TCFormerV1. Before the attention module, a token reduction layer is inserted to reduce the computation complexity. After the attention module, a depth-wise convolutional layer is included to extract local information. (b) The Spatial Token Reduction (SR) layer converts dynamic tokens into a feature map, which is subsequently compressed and flattened into key and value tokens.
  • Figure 4: Illustration of the dynamic vision token generation process. The Clustering-based Token Merge (CTM) module first groups the input tokens into several clusters and then merges the tokens in the same cluster into a single token via weighted feature averaging. After the CTM module, the merged tokens and the original tokens are input into a transformer block for better feature aggregation.
  • Figure 5: A typical example of the dynamic tokens produced by TCFormer. The input image is depicted in (a), and the dynamic tokens are presented in (b). The dynamic tokens can be converted into a high-resolution feature map (c), which retains the details but leads to large computational complexity, or a low-resolution feature map (d), which sacrifices the detailed information in the dynamic tokens.
  • ...and 11 more figures