Table of Contents
Fetching ...

DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification

Yujie Yang, Shuang Li, Jun Ye, Neng Dong, Fan Li, Huafeng Li

TL;DR

The paper tackles cross-modal video-based person re-identification across visible and infrared modalities, highlighting the underutilization of gait cues. It introduces DinoGRL, a two-branch framework that leverages DINOv2 priors to learn gait representations (SASGL) and progressively fuses gait with appearance (PBMGE) to form robust sequence embeddings. The optimization combines multi-granularity identity supervision with semantic priors, formalized as $L_{total} = L_{identity} + \lambda_1 L_{mask} + \lambda_2 L_{smo} + \lambda_3 L_{div}$. Experiments on HITSZ-VCM and BUPT show state-of-the-art results, validating the effectiveness of gait-guided representations for cross-modal video ReID and signaling potential for practical surveillance deployments across lighting conditions.

Abstract

Video-based Visible-Infrared person re-identification (VVI-ReID) aims to retrieve the same pedestrian across visible and infrared modalities from video sequences. Existing methods tend to exploit modality-invariant visual features but largely overlook gait features, which are not only modality-invariant but also rich in temporal dynamics, thus limiting their ability to model the spatiotemporal consistency essential for cross-modal video matching. To address these challenges, we propose a DINOv2-Driven Gait Representation Learning (DinoGRL) framework that leverages the rich visual priors of DINOv2 to learn gait features complementary to appearance cues, facilitating robust sequence-level representations for cross-modal retrieval. Specifically, we introduce a Semantic-Aware Silhouette and Gait Learning (SASGL) model, which generates and enhances silhouette representations with general-purpose semantic priors from DINOv2 and jointly optimizes them with the ReID objective to achieve semantically enriched and task-adaptive gait feature learning. Furthermore, we develop a Progressive Bidirectional Multi-Granularity Enhancement (PBMGE) module, which progressively refines feature representations by enabling bidirectional interactions between gait and appearance streams across multiple spatial granularities, fully leveraging their complementarity to enhance global representations with rich local details and produce highly discriminative features. Extensive experiments on HITSZ-VCM and BUPT datasets demonstrate the superiority of our approach, significantly outperforming existing state-of-the-art methods.

DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification

TL;DR

The paper tackles cross-modal video-based person re-identification across visible and infrared modalities, highlighting the underutilization of gait cues. It introduces DinoGRL, a two-branch framework that leverages DINOv2 priors to learn gait representations (SASGL) and progressively fuses gait with appearance (PBMGE) to form robust sequence embeddings. The optimization combines multi-granularity identity supervision with semantic priors, formalized as . Experiments on HITSZ-VCM and BUPT show state-of-the-art results, validating the effectiveness of gait-guided representations for cross-modal video ReID and signaling potential for practical surveillance deployments across lighting conditions.

Abstract

Video-based Visible-Infrared person re-identification (VVI-ReID) aims to retrieve the same pedestrian across visible and infrared modalities from video sequences. Existing methods tend to exploit modality-invariant visual features but largely overlook gait features, which are not only modality-invariant but also rich in temporal dynamics, thus limiting their ability to model the spatiotemporal consistency essential for cross-modal video matching. To address these challenges, we propose a DINOv2-Driven Gait Representation Learning (DinoGRL) framework that leverages the rich visual priors of DINOv2 to learn gait features complementary to appearance cues, facilitating robust sequence-level representations for cross-modal retrieval. Specifically, we introduce a Semantic-Aware Silhouette and Gait Learning (SASGL) model, which generates and enhances silhouette representations with general-purpose semantic priors from DINOv2 and jointly optimizes them with the ReID objective to achieve semantically enriched and task-adaptive gait feature learning. Furthermore, we develop a Progressive Bidirectional Multi-Granularity Enhancement (PBMGE) module, which progressively refines feature representations by enabling bidirectional interactions between gait and appearance streams across multiple spatial granularities, fully leveraging their complementarity to enhance global representations with rich local details and produce highly discriminative features. Extensive experiments on HITSZ-VCM and BUPT datasets demonstrate the superiority of our approach, significantly outperforming existing state-of-the-art methods.

Paper Structure

This paper contains 16 sections, 18 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Motivation of DinoGRL. (a) Existing shape-based VI-ReID methods often rely on image-level parsing networks, which are not optimized for ReID, particularly under the infrared modality——leading to noisy segmentation and neglect of temporal gait cues. (b) In contrast, DinoGRL leverages DINOv2 as a strong visual prior to produce high-quality silhouettes that facilitate the integration of sequence-level gait features, further refined by complementary appearance cues to achieve discriminative and modality-robust embeddings.
  • Figure 2: The overall framework of DinoGRL. This framework consists of two key modules: SASGL and PBMGE. SASGL employs a Semantic-Aware Silhouette Generator to produce modality-invariant silhouettes, leveraging the general-purpose visual priors of DINOv2 to facilitate gait representation learning. A Joint Learning Strategy is applied to simultaneously optimize silhouette generation and gait feature extraction, yielding gait representation $\mathbf{M}^t$. PBMGE further enhances global appearance and gait representations by integrating local features from the complementary stream across multiple granularities, yielding robust and discriminative pedestrian embeddings.
  • Figure 3: Illustration of the SASG, which produce and enrich silhouette representations with general-purpose semantic priors from DINOv2.
  • Figure 4: Results of Rank-1 and mAP with different values of $\lambda_{1}$, $\lambda_{2}$ and $\lambda_{3}$ on HITSZ-VCM dataset.
  • Figure 5: Pedestrian search results (Top-6 results; B/L: baseline; green: correct match; red: incorrect match.
  • ...and 1 more figures