Table of Contents
Fetching ...

Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification

Rifen Lin, Alex Jinpeng Wang, Jiawei Mo, Min Li

TL;DR

This work introduces CSIP-ReID, the first skeletonDriven pretraining framework for video based person re identification that performs genuine multimodal pretraining on ReID data by pairing skeleton sequences with aligned video frames. It comprises two stages: Stage 1 uses supervised contrastive losses to align skeleton and image features with a frozen CLIP visual encoder, and Stage 2 employs a Prototype Fusion Updater and Skeleton Guided Temporal Modeling to fuse motion and appearance while distilling temporal cues from skeleton data through Learning Using Privileged Information. The method yields state of the art results on standard video ReID benchmarks (MARS, LS-VID, iLIDS-VID) and demonstrates strong generalization to skeleton based ReID datasets (BIWI, IAS). Overall, CSIPReID establishes a motion aware, annotation free pretraining paradigm that broadens multimodal representation learning for ReID and related tasks.

Abstract

Multimodal pretraining has revolutionized visual understanding, but its impact on video-based person re-identification (ReID) remains underexplored. Existing approaches often rely on video-text pairs, yet suffer from two fundamental limitations: (1) lack of genuine multimodal pretraining, and (2) text poorly captures fine-grained temporal motion-an essential cue for distinguishing identities in video. In this work, we take a bold departure from text-based paradigms by introducing the first skeleton-driven pretraining framework for ReID. To achieve this, we propose Contrastive Skeleton-Image Pretraining for ReID (CSIP-ReID), a novel two-stage method that leverages skeleton sequences as a spatiotemporally informative modality aligned with video frames. In the first stage, we employ contrastive learning to align skeleton and visual features at sequence level. In the second stage, we introduce a dynamic Prototype Fusion Updater (PFU) to refine multimodal identity prototypes, fusing motion and appearance cues. Moreover, we propose a Skeleton Guided Temporal Modeling (SGTM) module that distills temporal cues from skeleton data and integrates them into visual features. Extensive experiments demonstrate that CSIP-ReID achieves new state-of-the-art results on standard video ReID benchmarks (MARS, LS-VID, iLIDS-VID). Moreover, it exhibits strong generalization to skeleton-only ReID tasks (BIWI, IAS), significantly outperforming previous methods. CSIP-ReID pioneers an annotation-free and motion-aware pretraining paradigm for ReID, opening a new frontier in multimodal representation learning.

Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification

TL;DR

This work introduces CSIP-ReID, the first skeletonDriven pretraining framework for video based person re identification that performs genuine multimodal pretraining on ReID data by pairing skeleton sequences with aligned video frames. It comprises two stages: Stage 1 uses supervised contrastive losses to align skeleton and image features with a frozen CLIP visual encoder, and Stage 2 employs a Prototype Fusion Updater and Skeleton Guided Temporal Modeling to fuse motion and appearance while distilling temporal cues from skeleton data through Learning Using Privileged Information. The method yields state of the art results on standard video ReID benchmarks (MARS, LS-VID, iLIDS-VID) and demonstrates strong generalization to skeleton based ReID datasets (BIWI, IAS). Overall, CSIPReID establishes a motion aware, annotation free pretraining paradigm that broadens multimodal representation learning for ReID and related tasks.

Abstract

Multimodal pretraining has revolutionized visual understanding, but its impact on video-based person re-identification (ReID) remains underexplored. Existing approaches often rely on video-text pairs, yet suffer from two fundamental limitations: (1) lack of genuine multimodal pretraining, and (2) text poorly captures fine-grained temporal motion-an essential cue for distinguishing identities in video. In this work, we take a bold departure from text-based paradigms by introducing the first skeleton-driven pretraining framework for ReID. To achieve this, we propose Contrastive Skeleton-Image Pretraining for ReID (CSIP-ReID), a novel two-stage method that leverages skeleton sequences as a spatiotemporally informative modality aligned with video frames. In the first stage, we employ contrastive learning to align skeleton and visual features at sequence level. In the second stage, we introduce a dynamic Prototype Fusion Updater (PFU) to refine multimodal identity prototypes, fusing motion and appearance cues. Moreover, we propose a Skeleton Guided Temporal Modeling (SGTM) module that distills temporal cues from skeleton data and integrates them into visual features. Extensive experiments demonstrate that CSIP-ReID achieves new state-of-the-art results on standard video ReID benchmarks (MARS, LS-VID, iLIDS-VID). Moreover, it exhibits strong generalization to skeleton-only ReID tasks (BIWI, IAS), significantly outperforming previous methods. CSIP-ReID pioneers an annotation-free and motion-aware pretraining paradigm for ReID, opening a new frontier in multimodal representation learning.

Paper Structure

This paper contains 49 sections, 29 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: We propose the first contrastive skeleton-image pretraining for ReID. Comparison of CLIP-style learning frameworks: (a) CLIP training. (b) one-stage TF-CLIP training. (c) Two-stage CLIP-ReID training. (d) Contrastive learning in stage 1 of our two-stage CSIP-ReID training.
  • Figure 2: Illustration of the proposed CSIP-ReID framework. (I) Extract skeleton data using human pose estimation model. (II) Stage 1: Contrasitive Skeleton-Image Pretraining. (III) Stage 2: Prototype-guided Finetuning, consisting of Prototype Fusion Updater (PFU) and Skeleton Guided Temporal Modeling (SGTM). (IV) Prototype Fusion Updater (PFU) computes and fuses modality-specific prototypes, dynamically updating them with batch visual-skeleton features. (V) Skeleton Guided Temporal Modeling (SGTM) uses MTE to generate message tokens, employs ATD to distill skeleton temporal cues into visual features, and applies TA to aggregate these cues across tokens for frame-level representation.
  • Figure 3: Analysis of key modules/factors affecting performance. This figure illustrates (a) the impact of different temporal fusion methods, (b) the effect of the hyperparameter $\lambda_1$, and (c) the effect of $\lambda_2$ on model performance.
  • Figure 4: CSIP-ReID produces more compact, discriminative clusters than TF-CLIP in the t-SNE visualization. Each color represents a different identity. Red circles highlight samples from two visually similar identities.
  • Figure 5: CSIP-ReID shows stronger attention focus on tokens corresponding to human regions. We compare the proposed method CSIP-ReID with Baseline and TF-CLIP.
  • ...and 3 more figures