Table of Contents
Fetching ...

Clothes-Changing Person Re-identification Based On Skeleton Dynamics

Asaf Joseph, Shmuel Peleg

TL;DR

This work tackles Clothes-Changing ReID by discarding appearance information and relying solely on skeleton dynamics. It introduces a spatio-temporal Graph Convolutional Network that processes two parallel skeleton streams (joints and bones) to generate segment-level descriptors, trained with a triplet loss and cosine distance. At test time, it leverages Re-Ranking and Re-Voting across multiple video segments to boost accuracy, achieving state-of-the-art results on the CCVID dataset while preserving privacy. The approach demonstrates that skeletal motion cues provide robust, clothing-invariant identity signatures with practical implications for privacy-aware surveillance and re-identification tasks.

Abstract

Clothes-Changing Person Re-Identification (ReID) aims to recognize the same individual across different videos captured at various times and locations. This task is particularly challenging due to changes in appearance, such as clothing, hairstyle, and accessories. We propose a Clothes-Changing ReID method that uses only skeleton data and does not use appearance features. Traditional ReID methods often depend on appearance features, leading to decreased accuracy when clothing changes. Our approach utilizes a spatio-temporal Graph Convolution Network (GCN) encoder to generate a skeleton-based descriptor for each individual. During testing, we improve accuracy by aggregating predictions from multiple segments of a video clip. Evaluated on the CCVID dataset with several different pose estimation models, our method achieves state-of-the-art performance, offering a robust and efficient solution for Clothes-Changing ReID.

Clothes-Changing Person Re-identification Based On Skeleton Dynamics

TL;DR

This work tackles Clothes-Changing ReID by discarding appearance information and relying solely on skeleton dynamics. It introduces a spatio-temporal Graph Convolutional Network that processes two parallel skeleton streams (joints and bones) to generate segment-level descriptors, trained with a triplet loss and cosine distance. At test time, it leverages Re-Ranking and Re-Voting across multiple video segments to boost accuracy, achieving state-of-the-art results on the CCVID dataset while preserving privacy. The approach demonstrates that skeletal motion cues provide robust, clothing-invariant identity signatures with practical implications for privacy-aware surveillance and re-identification tasks.

Abstract

Clothes-Changing Person Re-Identification (ReID) aims to recognize the same individual across different videos captured at various times and locations. This task is particularly challenging due to changes in appearance, such as clothing, hairstyle, and accessories. We propose a Clothes-Changing ReID method that uses only skeleton data and does not use appearance features. Traditional ReID methods often depend on appearance features, leading to decreased accuracy when clothing changes. Our approach utilizes a spatio-temporal Graph Convolution Network (GCN) encoder to generate a skeleton-based descriptor for each individual. During testing, we improve accuracy by aggregating predictions from multiple segments of a video clip. Evaluated on the CCVID dataset with several different pose estimation models, our method achieves state-of-the-art performance, offering a robust and efficient solution for Clothes-Changing ReID.

Paper Structure

This paper contains 16 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The information in Skeletons: Each row shows the same person, where the three left and three right video frames were taken months apart. Despite changes in clothing and appearance, the hands movements remains consistent and identifiable. This consistent skeletal motion forms the basis of our method, enabling robust person re-identification despite changes in appearance.
  • Figure 2: Method Overview: An input video with $T$ frames is first processed by a pose estimation model, then fed into two streams of Graph Convolution Networks (GCNs) to extract features $f_j, f_b$ for the joints and bones streams, respectively. These features encode the video into a shared latent space. During training, the model is optimized using the triplet loss. During testing, two additional statistical methods, Re-Ranking and Re-Voting, are applied to extract the final ranked matching.
  • Figure 3: Schematic graph representation of skeleton sequences: Spatial edges connecting joints like $i^t$ and $j^t$ based on human body connectivity. Temporal edges linking all joints $i$ between frames $t$ and $t+1$. Each node is also connected to itself.
  • Figure 4: The distribution of the number of frames per video clip across the entire CCVID dataset. The X-axis represents the number of frames in each video clip, while the Y-axis shows the number of videos corresponding to each frame count.
  • Figure 5: Re-Ranking Example: The top row shows the query image in blue frame, and its top 4 Nearest Neighbors (4-NN). Positive examples are in green frames, and negative examples are in black frames. Below each neighbor are its own 4-NN. We observe that the positive examples have the query as their own 4-NN. Consequently, K-NN neighbors that have the query as their own K-NN will be re-ranked higher than neighbors that do not have the query as their own K-NN.