Table of Contents
Fetching ...

AG-ReID.v2: Bridging Aerial and Ground Views for Person Re-identification

Huy Nguyen, Kien Nguyen, Sridha Sridharan, Clinton Fookes

TL;DR

This work tackles cross-view person re-identification by introducing AG-ReID.v2, a large-scale dataset combining aerial, CCTV, and wearable imagery with 15 soft attributes to capture cross-view variations. It proposes a novel three-stream architecture built on a Vision Transformer backbone: a transformer-based ReID stream, an elevated-view head-focused stream, and an explainable attribute-guided stream, integrated through a metric and attribute-aware loss framework. Key contributions include the expanded AG-ReID.v2 dataset, the Explainable Elevated-View Attention (EP+EVA) architecture, and comprehensive experimental validation showing improvements over state-of-the-art baselines on aerial-ground ReID tasks. The dataset and code are publicly released to accelerate research in cross-domain surveillance and attribute-guided ReID under realistic multimodal conditions.

Abstract

Aerial-ground person re-identification (Re-ID) presents unique challenges in computer vision, stemming from the distinct differences in viewpoints, poses, and resolutions between high-altitude aerial and ground-based cameras. Existing research predominantly focuses on ground-to-ground matching, with aerial matching less explored due to a dearth of comprehensive datasets. To address this, we introduce AG-ReID.v2, a dataset specifically designed for person Re-ID in mixed aerial and ground scenarios. This dataset comprises 100,502 images of 1,615 unique individuals, each annotated with matching IDs and 15 soft attribute labels. Data were collected from diverse perspectives using a UAV, stationary CCTV, and smart glasses-integrated camera, providing a rich variety of intra-identity variations. Additionally, we have developed an explainable attention network tailored for this dataset. This network features a three-stream architecture that efficiently processes pairwise image distances, emphasizes key top-down features, and adapts to variations in appearance due to altitude differences. Comparative evaluations demonstrate the superiority of our approach over existing baselines. We plan to release the dataset and algorithm source code publicly, aiming to advance research in this specialized field of computer vision. For access, please visit https://github.com/huynguyen792/AG-ReID.v2.

AG-ReID.v2: Bridging Aerial and Ground Views for Person Re-identification

TL;DR

This work tackles cross-view person re-identification by introducing AG-ReID.v2, a large-scale dataset combining aerial, CCTV, and wearable imagery with 15 soft attributes to capture cross-view variations. It proposes a novel three-stream architecture built on a Vision Transformer backbone: a transformer-based ReID stream, an elevated-view head-focused stream, and an explainable attribute-guided stream, integrated through a metric and attribute-aware loss framework. Key contributions include the expanded AG-ReID.v2 dataset, the Explainable Elevated-View Attention (EP+EVA) architecture, and comprehensive experimental validation showing improvements over state-of-the-art baselines on aerial-ground ReID tasks. The dataset and code are publicly released to accelerate research in cross-domain surveillance and attribute-guided ReID under realistic multimodal conditions.

Abstract

Aerial-ground person re-identification (Re-ID) presents unique challenges in computer vision, stemming from the distinct differences in viewpoints, poses, and resolutions between high-altitude aerial and ground-based cameras. Existing research predominantly focuses on ground-to-ground matching, with aerial matching less explored due to a dearth of comprehensive datasets. To address this, we introduce AG-ReID.v2, a dataset specifically designed for person Re-ID in mixed aerial and ground scenarios. This dataset comprises 100,502 images of 1,615 unique individuals, each annotated with matching IDs and 15 soft attribute labels. Data were collected from diverse perspectives using a UAV, stationary CCTV, and smart glasses-integrated camera, providing a rich variety of intra-identity variations. Additionally, we have developed an explainable attention network tailored for this dataset. This network features a three-stream architecture that efficiently processes pairwise image distances, emphasizes key top-down features, and adapts to variations in appearance due to altitude differences. Comparative evaluations demonstrate the superiority of our approach over existing baselines. We plan to release the dataset and algorithm source code publicly, aiming to advance research in this specialized field of computer vision. For access, please visit https://github.com/huynguyen792/AG-ReID.v2.
Paper Structure (28 sections, 17 equations, 11 figures, 7 tables)

This paper contains 28 sections, 17 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Aerial (a), CCTV (b), and wearable camera (c) perspectives vary in resolution, occlusion, and lighting in the AG-ReID.v2 dataset.
  • Figure 2: Data Collection Areas for the AG-ReID.v2 dataset.
  • Figure 3: 15 soft-biometric labels in the AG-ReID.v2 dataset.
  • Figure 4: Top 20 attribute distribution in our dataset.
  • Figure 5: Example images from two ground-ground datasets, Market-1501 Zheng2015ScalablePR and DukeMTMC-reID Gou2017DukeMTMC4ReIDAL, alongside two aerial-aerial datasets, P-DESTRE Kumar2021ThePA and UAV-Human Li2021UAVHumanAL, in comparison with our aerial-ground dataset, AG-ReID.v2. The images from AG-ReID.v2 highlight distinct challenges associated with reconciling perspective variances between ground-based (bottom row) and aerial-based (top row) images of individuals. This contrast is not as prevalent in the other datasets, which are confined to a single domain, either aerial or ground.
  • ...and 6 more figures