Table of Contents
Fetching ...

PartFormer: Awakening Latent Diverse Representation from Vision Transformer for Object Re-Identification

Lei Tan, Pingyang Dai, Jie Chen, Liujuan Cao, Yongjian Wu, Rongrong Ji

TL;DR

The PartFormer is presented, an innovative adaptation of ViT designed to overcome the granularity limitations in object Re-ID tasks and significantly outperforms state-of-the-art by 2.4\% mAP scores on the most challenging MSMT17 dataset.

Abstract

Extracting robust feature representation is critical for object re-identification to accurately identify objects across non-overlapping cameras. Although having a strong representation ability, the Vision Transformer (ViT) tends to overfit on most distinct regions of training data, limiting its generalizability and attention to holistic object features. Meanwhile, due to the structural difference between CNN and ViT, fine-grained strategies that effectively address this issue in CNN do not continue to be successful in ViT. To address this issue, by observing the latent diverse representation hidden behind the multi-head attention, we present PartFormer, an innovative adaptation of ViT designed to overcome the granularity limitations in object Re-ID tasks. The PartFormer integrates a Head Disentangling Block (HDB) that awakens the diverse representation of multi-head self-attention without the typical loss of feature richness induced by concatenation and FFN layers post-attention. To avoid the homogenization of attention heads and promote robust part-based feature learning, two head diversity constraints are imposed: attention diversity constraint and correlation diversity constraint. These constraints enable the model to exploit diverse and discriminative feature representations from different attention heads. Comprehensive experiments on various object Re-ID benchmarks demonstrate the superiority of the PartFormer. Specifically, our framework significantly outperforms state-of-the-art by 2.4\% mAP scores on the most challenging MSMT17 dataset.

PartFormer: Awakening Latent Diverse Representation from Vision Transformer for Object Re-Identification

TL;DR

The PartFormer is presented, an innovative adaptation of ViT designed to overcome the granularity limitations in object Re-ID tasks and significantly outperforms state-of-the-art by 2.4\% mAP scores on the most challenging MSMT17 dataset.

Abstract

Extracting robust feature representation is critical for object re-identification to accurately identify objects across non-overlapping cameras. Although having a strong representation ability, the Vision Transformer (ViT) tends to overfit on most distinct regions of training data, limiting its generalizability and attention to holistic object features. Meanwhile, due to the structural difference between CNN and ViT, fine-grained strategies that effectively address this issue in CNN do not continue to be successful in ViT. To address this issue, by observing the latent diverse representation hidden behind the multi-head attention, we present PartFormer, an innovative adaptation of ViT designed to overcome the granularity limitations in object Re-ID tasks. The PartFormer integrates a Head Disentangling Block (HDB) that awakens the diverse representation of multi-head self-attention without the typical loss of feature richness induced by concatenation and FFN layers post-attention. To avoid the homogenization of attention heads and promote robust part-based feature learning, two head diversity constraints are imposed: attention diversity constraint and correlation diversity constraint. These constraints enable the model to exploit diverse and discriminative feature representations from different attention heads. Comprehensive experiments on various object Re-ID benchmarks demonstrate the superiority of the PartFormer. Specifically, our framework significantly outperforms state-of-the-art by 2.4\% mAP scores on the most challenging MSMT17 dataset.
Paper Structure (14 sections, 11 equations, 4 figures, 6 tables)

This paper contains 14 sections, 11 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The motivation of proposed PartFomer. We visualize the head's attention in the final ViT block to explain the advantage of our method. In vanilla ViT (Top), we observe that multi-head attention contains a diverse focus on different parts of the object. But, after the subsequent layers, most of them are lost. This yields incomplete representation and limits the re-identification performance. In our proposed Partformer (Bottom), we discover the cause of such degradation and awaken the latent diverse representation behind the multi-head attention.
  • Figure 2: (a) The Architecture and Training Pipeline of PartFormer. PartFromer constructs by basic Transformer blocks inherited from the vanilla ViT, while in the last block, the head disentangling block takes the place of the original transformer block to awaken the latent diverse representation hidden behind the multiple heads. (b) The Architecture of Head Disentangling Block (HDB). HDB aims to get rid of the fusion and selection processing in the transformer block. In HDB, after the multi-head attention operation, the heads' diverse representation is directly output with an unshared linear projection. (c) The Explanation of Correlation Diversity Constraint ($\mathcal{L}_{cdc}$). The $\mathcal{L}_{cdc}$ considers that after removing the score obtained by ground truth, the distribution after the classifier is a correlation distribution for each input image. $\mathcal{L}_{cdc}$ encourages the different heads to show a different correlation distribution for the same input image. Here, we show an example of $\mathcal{L}_{cdc}$ after training in the MSMT17. For the input image Input, Neg. 1 and Neg. 2 are the two hardest negative samples in $head_i$ and $head_j$ respectively. Due to the $\mathcal{L}_{cdc}$, the similarity on the bike in $head_i$ will prevent Input and Neg. 1 from still similar in $head_j$, thus pushing $head_j$ to focus other content such as clothing.
  • Figure 3: Analysis of parameters $\alpha$ and $\beta$ in Eq. (\ref{['eq:loss']}). The optimal performance reaches when $\alpha = 0.1$ and $\beta = 3.0$.
  • Figure 4: Visualization results of PartFormer on the MSMT17.