Table of Contents
Fetching ...

SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification

Yuhao Wang, Xiang Hu, Lixin Wang, Pingping Zhang, Huchuan Lu

TL;DR

SD-ReID tackles cross-view aerial-ground person re-identification by introducing a two-stage framework that jointly learns identity- and view-aware representations and then uses a Stable Diffusion model to generate view-specific features conditioned on identity and view cues. A memory bank stores global view prototypes to guide inference when instance-level view information is unavailable, while a View-Refined Decoder (VRD) aligns generated features with backbone representations to reduce distribution gaps. The method integrates a condition learner to fuse intermediate identity descriptors with global view cues, enabling robust cross-view generation. Across five AG-ReID benchmarks, SD-ReID achieves state-of-the-art performance, demonstrating the value of explicit view-specific feature generation for cross-view retrieval and its practical potential for real-world surveillance scenarios.

Abstract

Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative models to maintain the identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust model is a very challenging task. Moreover, they overlook the contribution of view-specific features in enhancing the model's ability to represent persons. To address these issues, we propose a novel generative framework named SD-ReID for AG-ReID, which leverages generative models to mimic the feature distribution of different views while extracting robust identity representations. More specifically, we first train a ViT-based model to extract person representations along with controllable conditions, including identity and view conditions. We then fine-tune the Stable Diffusion (SD) model to enhance person representations guided by these controllable conditions. Furthermore, we introduce the View-Refined Decoder (VRD) to bridge the gap between instance-level and global-level features. Finally, both person representations and all-view features are employed to retrieve target persons. Extensive experiments on five AG-ReID benchmarks (i.e., CARGO, AG-ReIDv1, AG-ReIDv2, LAGPeR and G2APS-ReID) demonstrate the effectiveness of our proposed method. The source code will be available.

SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification

TL;DR

SD-ReID tackles cross-view aerial-ground person re-identification by introducing a two-stage framework that jointly learns identity- and view-aware representations and then uses a Stable Diffusion model to generate view-specific features conditioned on identity and view cues. A memory bank stores global view prototypes to guide inference when instance-level view information is unavailable, while a View-Refined Decoder (VRD) aligns generated features with backbone representations to reduce distribution gaps. The method integrates a condition learner to fuse intermediate identity descriptors with global view cues, enabling robust cross-view generation. Across five AG-ReID benchmarks, SD-ReID achieves state-of-the-art performance, demonstrating the value of explicit view-specific feature generation for cross-view retrieval and its practical potential for real-world surveillance scenarios.

Abstract

Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative models to maintain the identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust model is a very challenging task. Moreover, they overlook the contribution of view-specific features in enhancing the model's ability to represent persons. To address these issues, we propose a novel generative framework named SD-ReID for AG-ReID, which leverages generative models to mimic the feature distribution of different views while extracting robust identity representations. More specifically, we first train a ViT-based model to extract person representations along with controllable conditions, including identity and view conditions. We then fine-tune the Stable Diffusion (SD) model to enhance person representations guided by these controllable conditions. Furthermore, we introduce the View-Refined Decoder (VRD) to bridge the gap between instance-level and global-level features. Finally, both person representations and all-view features are employed to retrieve target persons. Extensive experiments on five AG-ReID benchmarks (i.e., CARGO, AG-ReIDv1, AG-ReIDv2, LAGPeR and G2APS-ReID) demonstrate the effectiveness of our proposed method. The source code will be available.

Paper Structure

This paper contains 19 sections, 17 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Motivations. (a) Previous AG-ReID methods focus on extracting view-shared features through cross-view feature alignment while discarding view-specific ones. (b) Our method leverages view-specific features with generative models conditioned on the opposite view to imitate counterparts.
  • Figure 2: Overall framework of the proposed SD-ReID. In the first stage, a view-aware Transformer encoder extracts person representations $I^L$ and instance-level view features $\tilde{P}_V$, while global view prototypes $M_A$ and $M_G$ are dynamically updated in the memory bank for stable cross-instance view conditions. In the second stage, Stable Diffusion (SD) is adopted to generate view-specific features. The condition learner integrates intermediate representations $\tilde{I}$ and global view prototypes, while the proposed View-Refined Decoder (VRD) injects instance-level view features at multiple scales. During inference, unavailable cross-view features are replaced by corresponding global prototypes retrieved from the memory bank. With the two stages, SD-ReID effectively enhances the discriminative ability of person representations for AG-ReID.
  • Figure 3: Details of the condition learner based on aerial input.
  • Figure 4: Inference process from aerial input to ground view feature generation.
  • Figure 5: mINP comparison with different layer numbers of the condition learner under both A→G and G→A protocols.
  • ...and 9 more figures