Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification

Yingquan Wang; Pingping Zhang; Dong Wang; Huchuan Lu

Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification

Yingquan Wang, Pingping Zhang, Dong Wang, Huchuan Lu

TL;DR

GLTrans addresses the challenge of robust object Re-ID by integrating global and local features within a Vision Transformer framework. It introduces a Global Aggregation Encoder to fuse multi-layer class tokens into a comprehensive global descriptor and a Local Multi-layer Fusion pipeline to reweight and combine multi-layer patch tokens with global guidance for fine-grained local representations. The architecture includes Patch Token Fusion, Global-guided Multi-head Attention, and Part-based Transformer Layers, all trained with a combined cross-entropy and triplet loss, achieving strong results on Market1501, DukeMTMC-ReID, MSMT17, and VeRi-776. The findings demonstrate that jointly leveraging global semantics from multiple ViT layers and discriminative local cues from multi-layer patch tokens yields more robust and transferable Re-ID features.

Abstract

Object Re-Identification (Re-ID) aims to identify and retrieve specific objects from images captured at different places and times. Recently, object Re-ID has achieved great success with the advances of Vision Transformers (ViT). However, the effects of the global-local relation have not been fully explored in Transformers for object Re-ID. In this work, we first explore the influence of global and local features of ViT and then further propose a novel Global-Local Transformer (GLTrans) for high-performance object Re-ID. We find that the features from last few layers of ViT already have a strong representational ability, and the global and local information can mutually enhance each other. Based on this fact, we propose a Global Aggregation Encoder (GAE) to utilize the class tokens of the last few Transformer layers and learn comprehensive global features effectively. Meanwhile, we propose the Local Multi-layer Fusion (LMF) which leverages both the global cues from GAE and multi-layer patch tokens to explore the discriminative local representations. Extensive experiments demonstrate that our proposed method achieves superior performance on four object Re-ID benchmarks.

Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification

TL;DR

Abstract

Paper Structure (21 sections, 17 equations, 6 figures, 10 tables)

This paper contains 21 sections, 17 equations, 6 figures, 10 tables.

Introduction
Related Work
CNNs for Object Re-Identification
Transformers for Object Re-Identification
Proposed Method
Revisiting Vision Transformer
Global Aggregation Encoder
Local Multi-layer Fusion
Patch Token Fusion
Global-guided Multi-head Attention
Part-based Transformer Layers
Loss Functions
Experiments
Datasets and Evaluation Metrics
Implementation Details
...and 6 more sections

Figures (6)

Figure 1: Different structures employed in object Re-ID. (a) Part-based CNNs for local features. (b) Pure Transformers for global features. (c) Our proposed GLTrans method considers both local and global features.
Figure 2: Heatmap visualization of ViT's different layers by Gram-Cam gram_cam on MSMT17. Specifically, Layer10, Layer11 and Layer12 mean the heatmap of the 10-th, 11-th and 12-th layers from ViT. Deeper red colors signify higher weights.
Figure 3: Our proposed GLTrans. The Vision Transformer (ViT) with side information embedding (cameras or viewpoints) is adopted as the backbone to obtain multi-layer class tokens and patch tokens. Then, two branches are used to extract global and local representations. The Global Aggregation Encoder (GAE) generates global representations by incorporating multi-layer class tokens, while Local Multi-layer Fusion (LMF) takes patch tokens as inputs to further extract the local-wise discriminative features.
Figure 4: Our proposed Patch Token Fusion (PTF) and Global-guided Multi-head Attention (GMA).
Figure 5: Visualization of the differences between ViT, PCB$^*$ and GLTrans by Grad-CAM gram_cam. Deeper red colors signify higher weights. The first row is the input images. The second, third and fourth rows are the activation maps produced by ViT, PCB$^{*}$ and our GLTrans, respectively.
...and 1 more figures

Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification

TL;DR

Abstract

Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification

Authors

TL;DR

Abstract

Table of Contents

Figures (6)