View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network

Quan Zhang; Lei Wang; Vishal M. Patel; Xiaohua Xie; Jianhuang Lai

View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network

Quan Zhang, Lei Wang, Vishal M. Patel, Xiaohua Xie, Jianhuang Lai

TL;DR

Experiments show that VDT is a feasible and effective solution for AGPReID, surpassing the previous method on mAP/Rank1 by up to 5.0%/2.7% on CARGO and 3.7%/5.2% on AG-ReID, keeping the same magnitude of computational complexity.

Abstract

Existing person re-identification methods have achieved remarkable advances in appearance-based identity association across homogeneous cameras, such as ground-ground matching. However, as a more practical scenario, aerial-ground person re-identification (AGPReID) among heterogeneous cameras has received minimal attention. To alleviate the disruption of discriminative identity representation by dramatic view discrepancy as the most significant challenge in AGPReID, the view-decoupled transformer (VDT) is proposed as a simple yet effective framework. Two major components are designed in VDT to decouple view-related and view-unrelated features, namely hierarchical subtractive separation and orthogonal loss, where the former separates these two features inside the VDT, and the latter constrains these two to be independent. In addition, we contribute a large-scale AGPReID dataset called CARGO, consisting of five/eight aerial/ground cameras, 5,000 identities, and 108,563 images. Experiments on two datasets show that VDT is a feasible and effective solution for AGPReID, surpassing the previous method on mAP/Rank1 by up to 5.0%/2.7% on CARGO and 3.7%/5.2% on AG-ReID, keeping the same magnitude of computational complexity. Our project is available at https://github.com/LinlyAC/VDT-AGPReID

View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network

TL;DR

Abstract

Paper Structure (24 sections, 10 equations, 7 figures, 5 tables)

This paper contains 24 sections, 10 equations, 7 figures, 5 tables.

Introduction
Related Work
View-homogeneous ReID
View-heterogeneous ReID
Synthetic ReID Dataset
Method
Formulation and Overview
View-decoupled Transformer
Optimization
Dataset: CARGO
Motivation
Description
Challenge
Experiment
Dataset and Metric
...and 9 more sections

Figures (7)

Figure 1: View-homogeneous vs. view-heterogeneous ReID, where the former focuses on ground-only or aerial-only camera networks, and the latter considers the aerial-ground mixed camera network. Thus, view-heterogeneous ReID considers aerial-aerial, ground-ground, and aerial-ground matching, which is more challenging and practical than the existing view-homogeneous ReID.
Figure 2: Illustration of the proposed VDT framework, which consists of $N$ VDT blocks and three-part loss functions. Meta and view tokens capture global and view-related features in images, respectively. Each VDT block (light blue module) consists of a standard self-attention encoder layer and an inner feature subtraction operation, achieving layer-by-layer decoupling of view-related and view-unrelated features. Orthogonal loss constrains the above two features to be further independent.
Figure 3: \ref{['fig3a']}$\sim$\ref{['fig3c']} shows the pedestrian models, camera deployment, and challenges during the CARGO construction, respectively. In \ref{['fig3b']}, "A-Cam" and "G-Cam" represent the aerial and ground cameras, where the yellow sectors represent the view range of the ground cameras, and the green arrows represent the motion strategy of the aerial cameras. The challenges displayed in \ref{['fig3c']} are view variation, illumination variation, occlusion, and resolution variation from top to bottom.
Figure 4: \ref{['ab1']}$\sim$\ref{['ab3']} show the ablation experiments about orthogonal decoupling of view-related and view-unrelated features in the VDT on two dataset, which consists of two important parts, \ref{['innersub']} and \ref{['oreg']}. Rank1, mAP, and mINP are reported (%).
Figure 5: \ref{['para1']}$\sim$\ref{['para3']} show the relationship between $\lambda$ and performance on two datasets. For simplicity, only protocol 1 is shown on the CARGO dataset. Rank1, mAP, and mINP are reported (%). Avg represents the average performance of Rank1, mAP, and mINP.
...and 2 more figures

View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network

TL;DR

Abstract

View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network

Authors

TL;DR

Abstract

Table of Contents

Figures (7)