Table of Contents
Fetching ...

Dynamic Token Selection for Aerial-Ground Person Re-Identification

Yuhai Wang, Maryam Pishgar

TL;DR

The paper tackles cross-view Aerial-Ground Person Re-Identification (AGPReID), where appearance shifts across aerial and ground views and cluttered backgrounds hinder performance. It proposes Dynamic Token Selective Transformer (DTST), which uses a View-Decoupled Transformer (VDT) with a Visual Token Selector to dynamically pick Top-$K$ informative tokens (via scores $s_i$ and differentiable Top-$K$ relaxations) and an orthogonal loss to separate view-related from view-agnostic features. Empirical results on AG-ReID and especially the CARGO dataset show state-of-the-art performance, with DTST delivering notable gains in mAP, Rank-1, and mINP over baselines, and demonstrating robustness to occlusion and complex backgrounds while reducing computation. The work advances AGPReID by combining targeted token-level feature selection with cross-view disentanglement, enabling more efficient and accurate identity representation across heterogeneous camera views.

Abstract

Aerial-Ground Person Re-identification (AGPReID) holds significant practical value but faces unique challenges due to pronounced variations in viewing angles, lighting conditions, and background interference. Traditional methods, often involving a global analysis of the entire image, frequently lead to inefficiencies and susceptibility to irrelevant data. In this paper, we propose a novel Dynamic Token Selective Transformer (DTST) tailored for AGPReID, which dynamically selects pivotal tokens to concentrate on pertinent regions. Specifically, we segment the input image into multiple tokens, with each token representing a unique region or feature within the image. Using a Top-k strategy, we extract the k most significant tokens that contain vital information essential for identity recognition. Subsequently, an attention mechanism is employed to discern interrelations among diverse tokens, thereby enhancing the representation of identity features. Extensive experiments on benchmark datasets showcases the superiority of our method over existing works. Notably, on the CARGO dataset, our proposed method gains 1.18% mAP improvements when compared to the second place. In addition, we comprehensively analyze the impact of different numbers of tokens, token insertion positions, and numbers of heads on model performance.

Dynamic Token Selection for Aerial-Ground Person Re-Identification

TL;DR

The paper tackles cross-view Aerial-Ground Person Re-Identification (AGPReID), where appearance shifts across aerial and ground views and cluttered backgrounds hinder performance. It proposes Dynamic Token Selective Transformer (DTST), which uses a View-Decoupled Transformer (VDT) with a Visual Token Selector to dynamically pick Top- informative tokens (via scores and differentiable Top- relaxations) and an orthogonal loss to separate view-related from view-agnostic features. Empirical results on AG-ReID and especially the CARGO dataset show state-of-the-art performance, with DTST delivering notable gains in mAP, Rank-1, and mINP over baselines, and demonstrating robustness to occlusion and complex backgrounds while reducing computation. The work advances AGPReID by combining targeted token-level feature selection with cross-view disentanglement, enabling more efficient and accurate identity representation across heterogeneous camera views.

Abstract

Aerial-Ground Person Re-identification (AGPReID) holds significant practical value but faces unique challenges due to pronounced variations in viewing angles, lighting conditions, and background interference. Traditional methods, often involving a global analysis of the entire image, frequently lead to inefficiencies and susceptibility to irrelevant data. In this paper, we propose a novel Dynamic Token Selective Transformer (DTST) tailored for AGPReID, which dynamically selects pivotal tokens to concentrate on pertinent regions. Specifically, we segment the input image into multiple tokens, with each token representing a unique region or feature within the image. Using a Top-k strategy, we extract the k most significant tokens that contain vital information essential for identity recognition. Subsequently, an attention mechanism is employed to discern interrelations among diverse tokens, thereby enhancing the representation of identity features. Extensive experiments on benchmark datasets showcases the superiority of our method over existing works. Notably, on the CARGO dataset, our proposed method gains 1.18% mAP improvements when compared to the second place. In addition, we comprehensively analyze the impact of different numbers of tokens, token insertion positions, and numbers of heads on model performance.

Paper Structure

This paper contains 13 sections, 3 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: A straightforward description of Aerial-Ground Person Re-identification (AGPReID) involves the utilization of an aerial-ground mixed camera network, enabling matching across aerial-aerial, ground-ground, and aerial-ground scenarios. Thus, it presents greater challenges and practical applications compared to traditional single-camera person ReID methods.
  • Figure 2: Illustration of the proposed Dynamic Token Selective Transformer (DTST) framework. The framework incorporates $N$ Token Selection view-decoupled transformer (VDT) blocks, where each block consists of an encoder layer and a visual token selector. The loss function is designed to account for both view-related and view-unrelated features, while an orthogonal loss ensures that these features remain independent from each other, further enhancing feature disentanglement and robustness.
  • Figure 3: The Illustration of Visual Token Selector (VTS). The process involves selecting the Top-K informative tokens from the original token set to be used in the subsequent feature aggregation.