Table of Contents
Fetching ...

AAformer: Auto-Aligned Transformer for Person Re-Identification

Kuan Zhu, Haiyun Guo, Shiliang Zhang, Yaowei Wang, Jing Liu, Jinqiao Wang, Ming Tang

TL;DR

An alignment scheme in transformer architecture for the first time is introduced and the auto-aligned transformer (AAformer) is proposed to automatically locate both the human parts and nonhuman ones at patch level to help in person re-identification.

Abstract

In person re-identification (re-ID), extracting part-level features from person images has been verified to be crucial to offer fine-grained information. Most of the existing CNN-based methods only locate the human parts coarsely, or rely on pretrained human parsing models and fail in locating the identifiable nonhuman parts (e.g., knapsack). In this article, we introduce an alignment scheme in transformer architecture for the first time and propose the auto-aligned transformer (AAformer) to automatically locate both the human parts and nonhuman ones at patch level. We introduce the "Part tokens ([PART]s)", which are learnable vectors, to extract part features in the transformer. A [PART] only interacts with a local subset of patches in self-attention and learns to be the part representation. To adaptively group the image patches into different subsets, we design the auto-alignment. Auto-alignment employs a fast variant of optimal transport (OT) algorithm to online cluster the patch embeddings into several groups with the [PART]s as their prototypes. AAformer integrates the part alignment into the self-attention and the output [PART]s can be directly used as part features for retrieval. Extensive experiments validate the effectiveness of [PART]s and the superiority of AAformer over various state-of-the-art methods.

AAformer: Auto-Aligned Transformer for Person Re-Identification

TL;DR

An alignment scheme in transformer architecture for the first time is introduced and the auto-aligned transformer (AAformer) is proposed to automatically locate both the human parts and nonhuman ones at patch level to help in person re-identification.

Abstract

In person re-identification (re-ID), extracting part-level features from person images has been verified to be crucial to offer fine-grained information. Most of the existing CNN-based methods only locate the human parts coarsely, or rely on pretrained human parsing models and fail in locating the identifiable nonhuman parts (e.g., knapsack). In this article, we introduce an alignment scheme in transformer architecture for the first time and propose the auto-aligned transformer (AAformer) to automatically locate both the human parts and nonhuman ones at patch level. We introduce the "Part tokens ([PART]s)", which are learnable vectors, to extract part features in the transformer. A [PART] only interacts with a local subset of patches in self-attention and learns to be the part representation. To adaptively group the image patches into different subsets, we design the auto-alignment. Auto-alignment employs a fast variant of optimal transport (OT) algorithm to online cluster the patch embeddings into several groups with the [PART]s as their prototypes. AAformer integrates the part alignment into the self-attention and the output [PART]s can be directly used as part features for retrieval. Extensive experiments validate the effectiveness of [PART]s and the superiority of AAformer over various state-of-the-art methods.

Paper Structure

This paper contains 13 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: (a) The illustration of the added [PART]s. A [PART] only interacts with a subset of the patch embeddings and thus can learn to represent the subset. (b) [PART]s with PCB's partitioning PCB. (c) [PART]s with SPReID's partitioning SPReID. (d) A toy example of the proposed Auto-Alignment mechanism. The image patches of the same part, which can be a human part or non-human one, are adaptively grouped to the identical [PART].
  • Figure 2: (a) The overview of AAformer. We divide the input image into fixed-size patches, linearly embed each of them and add the position embeddings. We add the extra learnable vectors of "Class token (CLS)" and "Part tokens ([PART]s)" to learn the global and part representations of person images. The [PART]s fed to the first Transformer layer are parameters of AAformer which learn the part prototypes of the datasets. The [PART]s output by Transformer layers are learned feature embeddings to represent human parts for input images. (b) Single-head Auto-Alignment. The self-attention for [PART]s is replaced by Auto-Alignment. $\mathbf{Q}^{\rm PT}$: query vectors of [PART]s. $\Phi_p$: the patches assigned to the $p$th [PART]. $\mathbf{Q, K, V}$: query, key, value vectors of patch embeddings.
  • Figure 3: Illustration of [PART] in different layers of AAformer. The [PART]s fed to the first Transformer layer are the learnable parameters we add to the network. They will learn to be the part prototypes of the dataset in the training and are dataset-adaptive. The output [PART]s of the Transformer layers are the part representations of the input images and they are instance-adaptive.
  • Figure 4: The ranking lists of TransReID and AAformer in misalignment scenes. Tiny clues are found by our AAformer.
  • Figure 5: The attention map of [PART] (PT). For one [PART], the patches not assigned to it are masked by black. The color range from blue to red indicates increasing attention.