Table of Contents
Fetching ...

SeCap: Self-Calibrating and Adaptive Prompts for Cross-view Person Re-Identification in Aerial-Ground Networks

Shining Wang, Yunlong Wang, Ruiqi Wu, Bingliang Jiao, Wenxuan Wang, Peng Wang

TL;DR

SeCap addresses the challenging cross-view AGPReID problem by introducing self-calibrating and adaptive prompts within an encoder–decoder transformer. The View Decoupling Transformer (VDT) decouples viewpoint information in the encoder, while the Prompt Re-calibration Module (PRM) and Local Feature Refinement Module (LFRM) in the decoder adapt prompts and refine local features to learn view-invariant representations. The approach is supported by two real-world datasets, LAGPeR and G2APS-ReID, and achieves state-of-the-art results across multiple AGPReID benchmarks, including strong cross-view performance and robustness to occlusion and viewpoint diversity. Overall, SeCap demonstrates effective cross-view alignment, improved local-feature discrimination, and practical impact for real-world aerial-ground surveillance scenarios, while contributing valuable datasets for the community.

Abstract

When discussing the Aerial-Ground Person Re-identification (AGPReID) task, we face the main challenge of the significant appearance variations caused by different viewpoints, making identity matching difficult. To address this issue, previous methods attempt to reduce the differences between viewpoints by critical attributes and decoupling the viewpoints. While these methods can mitigate viewpoint differences to some extent, they still face two main issues: (1) difficulty in handling viewpoint diversity and (2) neglect of the contribution of local features. To effectively address these challenges, we design and implement the Self-Calibrating and Adaptive Prompt (SeCap) method for the AGPReID task. The core of this framework relies on the Prompt Re-calibration Module (PRM), which adaptively re-calibrates prompts based on the input. Combined with the Local Feature Refinement Module (LFRM), SeCap can extract view-invariant features from local features for AGPReID. Meanwhile, given the current scarcity of datasets in the AGPReID field, we further contribute two real-world Large-scale Aerial-Ground Person Re-Identification datasets, LAGPeR and G2APS-ReID. The former is collected and annotated by us independently, covering $4,231$ unique identities and containing $63,841$ high-quality images; the latter is reconstructed from the person search dataset G2APS. Through extensive experiments on AGPReID datasets, we demonstrate that SeCap is a feasible and effective solution for the AGPReID task. The datasets and source code available on https://github.com/wangshining681/SeCap-AGPReID.

SeCap: Self-Calibrating and Adaptive Prompts for Cross-view Person Re-Identification in Aerial-Ground Networks

TL;DR

SeCap addresses the challenging cross-view AGPReID problem by introducing self-calibrating and adaptive prompts within an encoder–decoder transformer. The View Decoupling Transformer (VDT) decouples viewpoint information in the encoder, while the Prompt Re-calibration Module (PRM) and Local Feature Refinement Module (LFRM) in the decoder adapt prompts and refine local features to learn view-invariant representations. The approach is supported by two real-world datasets, LAGPeR and G2APS-ReID, and achieves state-of-the-art results across multiple AGPReID benchmarks, including strong cross-view performance and robustness to occlusion and viewpoint diversity. Overall, SeCap demonstrates effective cross-view alignment, improved local-feature discrimination, and practical impact for real-world aerial-ground surveillance scenarios, while contributing valuable datasets for the community.

Abstract

When discussing the Aerial-Ground Person Re-identification (AGPReID) task, we face the main challenge of the significant appearance variations caused by different viewpoints, making identity matching difficult. To address this issue, previous methods attempt to reduce the differences between viewpoints by critical attributes and decoupling the viewpoints. While these methods can mitigate viewpoint differences to some extent, they still face two main issues: (1) difficulty in handling viewpoint diversity and (2) neglect of the contribution of local features. To effectively address these challenges, we design and implement the Self-Calibrating and Adaptive Prompt (SeCap) method for the AGPReID task. The core of this framework relies on the Prompt Re-calibration Module (PRM), which adaptively re-calibrates prompts based on the input. Combined with the Local Feature Refinement Module (LFRM), SeCap can extract view-invariant features from local features for AGPReID. Meanwhile, given the current scarcity of datasets in the AGPReID field, we further contribute two real-world Large-scale Aerial-Ground Person Re-Identification datasets, LAGPeR and G2APS-ReID. The former is collected and annotated by us independently, covering unique identities and containing high-quality images; the latter is reconstructed from the person search dataset G2APS. Through extensive experiments on AGPReID datasets, we demonstrate that SeCap is a feasible and effective solution for the AGPReID task. The datasets and source code available on https://github.com/wangshining681/SeCap-AGPReID.

Paper Structure

This paper contains 32 sections, 7 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Aerial View and Ground View exhibit significant appearance variation due to notable differences in views. This variation poses substantial challenges for cross-view image matching.
  • Figure 2: (a) The architecture of the proposed SeCap. The key component is an encoder-decoder transformer. The encoder extracts the visual features of the picture and decouples the viewpoints. The decoder re-calibrates prompts through the current viewpoint information and decodes the local features using the re-calibrated prompts. (b) The Prompt Re-calibration Module (PRM) adaptively generates and re-calibrates prompts for different viewpoints according to view-invariant features. (c) The Local Feature Refinement Module (LFRM) finely decodes discriminative features from the local features using the re-calibrated prompts in PRM.
  • Figure 3: Visualize the features extracted by SeCap and the baseline model using t-SNE. Circles ($\bullet$) represent the Aerial View, and pluses (+) represent the Ground View. The same IDs are indicated by the same color.
  • Figure 4: Comparison of several retrieval visualizations on the LAGPeR dataset of setting $A\rightarrow G$. Red and green boxes represent wrong and correct matchings. The top five are listed.
  • Figure 5: The visualization results of the attention maps of our SeCap method and the baseline model.
  • ...and 4 more figures