Table of Contents
Fetching ...

LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification

Pingping Zhang, Xiang Hu, Yuhao Wang, Huchuan Lu

TL;DR

The paper tackles cross-view Aerial-Ground Person Re-Identification (AG-ReID) by leveraging stable human attributes as text-rich cues. It introduces LATex, a prompt-tuning framework that integrates an Attribute-aware Image Encoder (AIE), a Prompted Attribute Classifier Group (PACG), and a Coupled Prompt Template (CPT) to convert attribute information and view context into structured sentences processed by CLIP’s text encoder. This design enables explicit use of attribute-based textual knowledge, achieving strong performance across AG-ReID benchmarks while significantly reducing trainable parameters compared to full fine-tuning. The results demonstrate robustness to attribute-missing settings, efficiency gains, and clear qualitative insights into attribute-driven discrimination for cross-view person retrieval.

Abstract

As an important task in intelligent transportation systems, Aerial-Ground person Re-IDentification (AG-ReID) aims to retrieve specific persons across heterogeneous cameras in different viewpoints. Previous methods typically adopt deep learning-based models, focusing on extracting view-invariant features. However, they usually overlook the semantic information in person attributes. In addition, existing training strategies often rely on full fine-tuning large-scale models, which significantly increases training costs. To address these issues, we propose a novel framework named LATex for AG-ReID, which adopts prompt-tuning strategies to leverage attribute-based text knowledge. Specifically, with the Contrastive Language-Image Pre-training (CLIP) model, we first propose an Attribute-aware Image Encoder (AIE) to extract both global semantic features and attribute-aware features from input images. Then, with these features, we propose a Prompted Attribute Classifier Group (PACG) to predict person attributes and obtain attribute representations. Finally, we design a Coupled Prompt Template (CPT) to transform attribute representations and view information into structured sentences. These sentences are processed by the text encoder of CLIP to generate more discriminative features. As a result, our framework can fully leverage attribute-based text knowledge to improve AG-ReID performance. Extensive experiments on three AG-ReID benchmarks demonstrate the effectiveness of our proposed methods. The source code is available at https://github.com/kevinhu314/LATex.

LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification

TL;DR

The paper tackles cross-view Aerial-Ground Person Re-Identification (AG-ReID) by leveraging stable human attributes as text-rich cues. It introduces LATex, a prompt-tuning framework that integrates an Attribute-aware Image Encoder (AIE), a Prompted Attribute Classifier Group (PACG), and a Coupled Prompt Template (CPT) to convert attribute information and view context into structured sentences processed by CLIP’s text encoder. This design enables explicit use of attribute-based textual knowledge, achieving strong performance across AG-ReID benchmarks while significantly reducing trainable parameters compared to full fine-tuning. The results demonstrate robustness to attribute-missing settings, efficiency gains, and clear qualitative insights into attribute-driven discrimination for cross-view person retrieval.

Abstract

As an important task in intelligent transportation systems, Aerial-Ground person Re-IDentification (AG-ReID) aims to retrieve specific persons across heterogeneous cameras in different viewpoints. Previous methods typically adopt deep learning-based models, focusing on extracting view-invariant features. However, they usually overlook the semantic information in person attributes. In addition, existing training strategies often rely on full fine-tuning large-scale models, which significantly increases training costs. To address these issues, we propose a novel framework named LATex for AG-ReID, which adopts prompt-tuning strategies to leverage attribute-based text knowledge. Specifically, with the Contrastive Language-Image Pre-training (CLIP) model, we first propose an Attribute-aware Image Encoder (AIE) to extract both global semantic features and attribute-aware features from input images. Then, with these features, we propose a Prompted Attribute Classifier Group (PACG) to predict person attributes and obtain attribute representations. Finally, we design a Coupled Prompt Template (CPT) to transform attribute representations and view information into structured sentences. These sentences are processed by the text encoder of CLIP to generate more discriminative features. As a result, our framework can fully leverage attribute-based text knowledge to improve AG-ReID performance. Extensive experiments on three AG-ReID benchmarks demonstrate the effectiveness of our proposed methods. The source code is available at https://github.com/kevinhu314/LATex.

Paper Structure

This paper contains 19 sections, 13 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: An example that a person captured under (a) the aerial view by UAV and (c) the ground view by CCTV, along with (b) the corresponding person attributes. Despite significant variations in the images caused by drastic viewpoint changes, person attributes remain consistent.
  • Figure 2: The illustration of the proposed LATex framework. The Attribute-aware Image Encoder (AIE) first extracts global semantic features and attribute-aware features. Then, the Prompted Attribute Classifier Group (PACG) generates person attribute predictions and obtain specific representations of predicted attributes. Afterwards, the Coupled Prompt Template (CPT) transforms attribute representations and view information into structured sentences. Finally, the structured sentences are processed by the text encoder of CLIP to generate discriminative features for person ReID, integrated with global semantic features.
  • Figure 3: Performance with different numbers of prompts under two protocols.
  • Figure 4: Accuracy of attribute predictions in PACG.
  • Figure 5: The retrieval results using attribute features. Query images are marked with a yellow box. The corresponding attribute names and ground truths are displayed in blue and green boxes.
  • ...and 2 more figures