Table of Contents
Fetching ...

Text-based Aerial-Ground Person Retrieval

Xinyu Zhou, Yu Wu, Jiayao Ma, Wenhao Wang, Min Cao, Mang Ye

TL;DR

This paper introduces Text-based Aerial-Ground Person Retrieval (TAG-PR), addressing retrieval across heterogeneous aerial and ground views using natural language queries. It defines TAG-PEDES, a large cross-view dataset generated with a Diversified Text Generation paradigm, and TAG-CLIP, a cross-modal framework that handles view heterogeneity with a Hierarchically-Routed Mixture of Experts (HR-MoE) in the image encoder and a viewpoint decoupling strategy to align view-agnostic visual features with text. The results demonstrate that TAG-CLIP outperforms state-of-the-art with strong gains on TAG-PEDES and competitive performance on traditional ground-view benchmarks, while ablations confirm the importance of HR-MoE and viewpoint decoupling. The work advances practical cross-view person retrieval by integrating view-aware image encoding, targeted cross-modal alignment, and robust synthetic-text annotation, providing datasets and code to foster further research.

Abstract

This work introduces Text-based Aerial-Ground Person Retrieval (TAG-PR), which aims to retrieve person images from heterogeneous aerial and ground views with textual descriptions. Unlike traditional Text-based Person Retrieval (T-PR), which focuses solely on ground-view images, TAG-PR introduces greater practical significance and presents unique challenges due to the large viewpoint discrepancy across images. To support this task, we contribute: (1) TAG-PEDES dataset, constructed from public benchmarks with automatically generated textual descriptions, enhanced by a diversified text generation paradigm to ensure robustness under view heterogeneity; and (2) TAG-CLIP, a novel retrieval framework that addresses view heterogeneity through a hierarchically-routed mixture of experts module to learn view-specific and view-agnostic features and a viewpoint decoupling strategy to decouple view-specific features for better cross-modal alignment. We evaluate the effectiveness of TAG-CLIP on both the proposed TAG-PEDES dataset and existing T-PR benchmarks. The dataset and code are available at https://github.com/Flame-Chasers/TAG-PR.

Text-based Aerial-Ground Person Retrieval

TL;DR

This paper introduces Text-based Aerial-Ground Person Retrieval (TAG-PR), addressing retrieval across heterogeneous aerial and ground views using natural language queries. It defines TAG-PEDES, a large cross-view dataset generated with a Diversified Text Generation paradigm, and TAG-CLIP, a cross-modal framework that handles view heterogeneity with a Hierarchically-Routed Mixture of Experts (HR-MoE) in the image encoder and a viewpoint decoupling strategy to align view-agnostic visual features with text. The results demonstrate that TAG-CLIP outperforms state-of-the-art with strong gains on TAG-PEDES and competitive performance on traditional ground-view benchmarks, while ablations confirm the importance of HR-MoE and viewpoint decoupling. The work advances practical cross-view person retrieval by integrating view-aware image encoding, targeted cross-modal alignment, and robust synthetic-text annotation, providing datasets and code to foster further research.

Abstract

This work introduces Text-based Aerial-Ground Person Retrieval (TAG-PR), which aims to retrieve person images from heterogeneous aerial and ground views with textual descriptions. Unlike traditional Text-based Person Retrieval (T-PR), which focuses solely on ground-view images, TAG-PR introduces greater practical significance and presents unique challenges due to the large viewpoint discrepancy across images. To support this task, we contribute: (1) TAG-PEDES dataset, constructed from public benchmarks with automatically generated textual descriptions, enhanced by a diversified text generation paradigm to ensure robustness under view heterogeneity; and (2) TAG-CLIP, a novel retrieval framework that addresses view heterogeneity through a hierarchically-routed mixture of experts module to learn view-specific and view-agnostic features and a viewpoint decoupling strategy to decouple view-specific features for better cross-modal alignment. We evaluate the effectiveness of TAG-CLIP on both the proposed TAG-PEDES dataset and existing T-PR benchmarks. The dataset and code are available at https://github.com/Flame-Chasers/TAG-PR.

Paper Structure

This paper contains 32 sections, 14 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Illustration of TAG-PR. It aims to retrieve a target individual from an image gallery containing heterogeneous-view images, given a query text. The gallery includes ground-view images, typically captured by CCTV cameras at low altitudes (below 10 meters), and aerial-view images taken by UAVs from significantly higher altitudes ($\gg 10m$).
  • Figure 2: Examples from the proposed TAG-PEDES dataset. More examples are provided in the Appendix.
  • Figure 3: (a) Results of quality assessment on TAG-PEDES. (b) The proportion of identities from different viewpoints.
  • Figure 4: Overview of TAG-CLIP. It comprises an image encoder and a text encoder. Several ViT blocks in the image encoder are augmented with the HR-MoE module, which employs hierarchical routers and expert groups for robust visual feature extraction. A viewpoint decoupling strategy, consisting of two loss functions, is used to decouple viewpoint information from global visual features, thereby improving cross-modal alignment.
  • Figure 5: t-SNE visualization of extracted visual features. Each color represents a unique identity.
  • ...and 5 more figures