TCFormer: Visual Recognition via Token Clustering Transformer
Wang Zeng, Sheng Jin, Lumin Xu, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo, Xiaogang Wang
TL;DR
TCFormer introduces a dynamic-token vision transformer that forms tokens based on semantic meaning rather than a fixed grid. Through CTM clustering and the MTA module, it builds a multi-scale token pyramid that preserves detail while focusing computation on informative regions; TCFormerV2 adds Local CTM and CR-MTA to boost efficiency and accuracy. The approach achieves strong results across classification, pose estimation, semantic segmentation, and object detection, often surpassing grid-based transformers with lower computational cost. This dynamic-token design enhances object relationship learning and detail capture, offering practical advantages for diverse vision tasks and paving the way for further dynamic-token architectures.
Abstract
Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token distribution disregards the semantic meaning of different image regions, resulting in sub-optimal performance. To address this issue, we propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning. Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens. Through extensive experimentation across various applications, including image classification, human pose estimation, semantic segmentation, and object detection, we demonstrate the effectiveness of our TCFormer. The code and models for this work are available at https://github.com/zengwang430521/TCFormer.
