Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification
Rifen Lin, Alex Jinpeng Wang, Jiawei Mo, Min Li
TL;DR
This work introduces CSIP-ReID, the first skeletonDriven pretraining framework for video based person re identification that performs genuine multimodal pretraining on ReID data by pairing skeleton sequences with aligned video frames. It comprises two stages: Stage 1 uses supervised contrastive losses to align skeleton and image features with a frozen CLIP visual encoder, and Stage 2 employs a Prototype Fusion Updater and Skeleton Guided Temporal Modeling to fuse motion and appearance while distilling temporal cues from skeleton data through Learning Using Privileged Information. The method yields state of the art results on standard video ReID benchmarks (MARS, LS-VID, iLIDS-VID) and demonstrates strong generalization to skeleton based ReID datasets (BIWI, IAS). Overall, CSIPReID establishes a motion aware, annotation free pretraining paradigm that broadens multimodal representation learning for ReID and related tasks.
Abstract
Multimodal pretraining has revolutionized visual understanding, but its impact on video-based person re-identification (ReID) remains underexplored. Existing approaches often rely on video-text pairs, yet suffer from two fundamental limitations: (1) lack of genuine multimodal pretraining, and (2) text poorly captures fine-grained temporal motion-an essential cue for distinguishing identities in video. In this work, we take a bold departure from text-based paradigms by introducing the first skeleton-driven pretraining framework for ReID. To achieve this, we propose Contrastive Skeleton-Image Pretraining for ReID (CSIP-ReID), a novel two-stage method that leverages skeleton sequences as a spatiotemporally informative modality aligned with video frames. In the first stage, we employ contrastive learning to align skeleton and visual features at sequence level. In the second stage, we introduce a dynamic Prototype Fusion Updater (PFU) to refine multimodal identity prototypes, fusing motion and appearance cues. Moreover, we propose a Skeleton Guided Temporal Modeling (SGTM) module that distills temporal cues from skeleton data and integrates them into visual features. Extensive experiments demonstrate that CSIP-ReID achieves new state-of-the-art results on standard video ReID benchmarks (MARS, LS-VID, iLIDS-VID). Moreover, it exhibits strong generalization to skeleton-only ReID tasks (BIWI, IAS), significantly outperforming previous methods. CSIP-ReID pioneers an annotation-free and motion-aware pretraining paradigm for ReID, opening a new frontier in multimodal representation learning.
