Table of Contents
Fetching ...

Skeleton-Guided Spatial-Temporal Feature Learning for Video-Based Visible-Infrared Person Re-Identification

Wenjia Jiang, Xiaoke Zhu, Jiakang Gao, Di Liao

TL;DR

This work tackles video-based visible-infrared person re-identification (VVI-ReID) by addressing the fragility of spatial-temporal features under modality gaps and video quality issues. It introduces STAR, a two-level skeleton-guided framework that refines frame-level features using robust skeleton cues and aggregates sequence-level information through a skeleton-based graph attention mechanism with GeM pooling. The method includes a skeleton consistency loss to align joint- and edge-based skeleton representations and demonstrates state-of-the-art results on the HITSZ-VCM dataset, with clear ablations validating the contributions of frame- and sequence-level guidance. The findings highlight the practical potential of skeleton-guided learning to enhance cross-modality video person re-identification in challenging real-world conditions, with open-source code forthcoming.

Abstract

Video-based visible-infrared person re-identification (VVI-ReID) is challenging due to significant modality feature discrepancies. Spatial-temporal information in videos is crucial, but the accuracy of spatial-temporal information is often influenced by issues like low quality and occlusions in videos. Existing methods mainly focus on reducing modality differences, but pay limited attention to improving spatial-temporal features, particularly for infrared videos. To address this, we propose a novel Skeleton-guided spatial-Temporal feAture leaRning (STAR) method for VVI-ReID. By using skeleton information, which is robust to issues such as poor image quality and occlusions, STAR improves the accuracy of spatial-temporal features in videos of both modalities. Specifically, STAR employs two levels of skeleton-guided strategies: frame level and sequence level. At the frame level, the robust structured skeleton information is used to refine the visual features of individual frames. At the sequence level, we design a feature aggregation mechanism based on skeleton key points graph, which learns the contribution of different body parts to spatial-temporal features, further enhancing the accuracy of global features. Experiments on benchmark datasets demonstrate that STAR outperforms state-of-the-art methods. Code will be open source soon.

Skeleton-Guided Spatial-Temporal Feature Learning for Video-Based Visible-Infrared Person Re-Identification

TL;DR

This work tackles video-based visible-infrared person re-identification (VVI-ReID) by addressing the fragility of spatial-temporal features under modality gaps and video quality issues. It introduces STAR, a two-level skeleton-guided framework that refines frame-level features using robust skeleton cues and aggregates sequence-level information through a skeleton-based graph attention mechanism with GeM pooling. The method includes a skeleton consistency loss to align joint- and edge-based skeleton representations and demonstrates state-of-the-art results on the HITSZ-VCM dataset, with clear ablations validating the contributions of frame- and sequence-level guidance. The findings highlight the practical potential of skeleton-guided learning to enhance cross-modality video person re-identification in challenging real-world conditions, with open-source code forthcoming.

Abstract

Video-based visible-infrared person re-identification (VVI-ReID) is challenging due to significant modality feature discrepancies. Spatial-temporal information in videos is crucial, but the accuracy of spatial-temporal information is often influenced by issues like low quality and occlusions in videos. Existing methods mainly focus on reducing modality differences, but pay limited attention to improving spatial-temporal features, particularly for infrared videos. To address this, we propose a novel Skeleton-guided spatial-Temporal feAture leaRning (STAR) method for VVI-ReID. By using skeleton information, which is robust to issues such as poor image quality and occlusions, STAR improves the accuracy of spatial-temporal features in videos of both modalities. Specifically, STAR employs two levels of skeleton-guided strategies: frame level and sequence level. At the frame level, the robust structured skeleton information is used to refine the visual features of individual frames. At the sequence level, we design a feature aggregation mechanism based on skeleton key points graph, which learns the contribution of different body parts to spatial-temporal features, further enhancing the accuracy of global features. Experiments on benchmark datasets demonstrate that STAR outperforms state-of-the-art methods. Code will be open source soon.

Paper Structure

This paper contains 19 sections, 12 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The insights of our method. The Skeleton-guided spatial-Temporal feAture leaRning (STAR) method operates at two levels: At the frame level, skeleton guidance corrects incomplete vision feature information. At the sequence level, it ensures global feature consistency by leveraging skeleton-based guidance.
  • Figure 2: Overview of the STAR method for VVI-ReID. The final output represents the identity probabilities of pedestrians in different modalities. At the frame level, skeleton connections refine individual frame features, and at the sequence level, a skeleton graph sequence aggregates features to enhance global accuracy. STAR improves spatial-temporal feature extraction in both visible and infrared modalities.
  • Figure 3: Body Part Contribution Aware Skeleton Guidance: Illustration of the spatial-temporal skeleton graph and the dynamic pooling mechanism that selectively integrates global features based on Body Part Contributions.
  • Figure 4: Visualization of heat maps by using Grad-CAM Selvaraju_2017_ICCV, comparing the baseline and STAR, showing that STAR better highlights discriminative regions.
  • Figure 5: Comparison of mAP (%) at different sequence lengths for the baseline model and the STAR model. The shaded area represents the improvement margin achieved by the STAR model over the baseline. This comparison results were obtained using the MindSpore framework.
  • ...and 1 more figures