Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification
Yingquan Wang, Pingping Zhang, Dong Wang, Huchuan Lu
TL;DR
GLTrans addresses the challenge of robust object Re-ID by integrating global and local features within a Vision Transformer framework. It introduces a Global Aggregation Encoder to fuse multi-layer class tokens into a comprehensive global descriptor and a Local Multi-layer Fusion pipeline to reweight and combine multi-layer patch tokens with global guidance for fine-grained local representations. The architecture includes Patch Token Fusion, Global-guided Multi-head Attention, and Part-based Transformer Layers, all trained with a combined cross-entropy and triplet loss, achieving strong results on Market1501, DukeMTMC-ReID, MSMT17, and VeRi-776. The findings demonstrate that jointly leveraging global semantics from multiple ViT layers and discriminative local cues from multi-layer patch tokens yields more robust and transferable Re-ID features.
Abstract
Object Re-Identification (Re-ID) aims to identify and retrieve specific objects from images captured at different places and times. Recently, object Re-ID has achieved great success with the advances of Vision Transformers (ViT). However, the effects of the global-local relation have not been fully explored in Transformers for object Re-ID. In this work, we first explore the influence of global and local features of ViT and then further propose a novel Global-Local Transformer (GLTrans) for high-performance object Re-ID. We find that the features from last few layers of ViT already have a strong representational ability, and the global and local information can mutually enhance each other. Based on this fact, we propose a Global Aggregation Encoder (GAE) to utilize the class tokens of the last few Transformer layers and learn comprehensive global features effectively. Meanwhile, we propose the Local Multi-layer Fusion (LMF) which leverages both the global cues from GAE and multi-layer patch tokens to explore the discriminative local representations. Extensive experiments demonstrate that our proposed method achieves superior performance on four object Re-ID benchmarks.
