Skeleton-Guided Spatial-Temporal Feature Learning for Video-Based Visible-Infrared Person Re-Identification

Wenjia Jiang; Xiaoke Zhu; Jiakang Gao; Di Liao

Skeleton-Guided Spatial-Temporal Feature Learning for Video-Based Visible-Infrared Person Re-Identification

Wenjia Jiang, Xiaoke Zhu, Jiakang Gao, Di Liao

TL;DR

This work tackles video-based visible-infrared person re-identification (VVI-ReID) by addressing the fragility of spatial-temporal features under modality gaps and video quality issues. It introduces STAR, a two-level skeleton-guided framework that refines frame-level features using robust skeleton cues and aggregates sequence-level information through a skeleton-based graph attention mechanism with GeM pooling. The method includes a skeleton consistency loss to align joint- and edge-based skeleton representations and demonstrates state-of-the-art results on the HITSZ-VCM dataset, with clear ablations validating the contributions of frame- and sequence-level guidance. The findings highlight the practical potential of skeleton-guided learning to enhance cross-modality video person re-identification in challenging real-world conditions, with open-source code forthcoming.

Abstract

Video-based visible-infrared person re-identification (VVI-ReID) is challenging due to significant modality feature discrepancies. Spatial-temporal information in videos is crucial, but the accuracy of spatial-temporal information is often influenced by issues like low quality and occlusions in videos. Existing methods mainly focus on reducing modality differences, but pay limited attention to improving spatial-temporal features, particularly for infrared videos. To address this, we propose a novel Skeleton-guided spatial-Temporal feAture leaRning (STAR) method for VVI-ReID. By using skeleton information, which is robust to issues such as poor image quality and occlusions, STAR improves the accuracy of spatial-temporal features in videos of both modalities. Specifically, STAR employs two levels of skeleton-guided strategies: frame level and sequence level. At the frame level, the robust structured skeleton information is used to refine the visual features of individual frames. At the sequence level, we design a feature aggregation mechanism based on skeleton key points graph, which learns the contribution of different body parts to spatial-temporal features, further enhancing the accuracy of global features. Experiments on benchmark datasets demonstrate that STAR outperforms state-of-the-art methods. Code will be open source soon.

Skeleton-Guided Spatial-Temporal Feature Learning for Video-Based Visible-Infrared Person Re-Identification

TL;DR

Abstract

Skeleton-Guided Spatial-Temporal Feature Learning for Video-Based Visible-Infrared Person Re-Identification

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)