Table of Contents
Fetching ...

DIFFER: Disentangling Identity Features via Semantic Cues for Clothes-Changing Person Re-ID

Xin Liang, Yogesh S Rawat

TL;DR

DIFFER tackles clothes-changing person re-identification by disentangling biometric identity cues from non-biometric appearance factors using textual supervision. It leverages a visual encoder, semantic learning from VLM-generated biometric and non-biometric descriptions, and NBDetach with a gradient reversal layer to align image and text in disentangled subspaces. The approach achieves state-of-the-art results on four CC-ReID benchmarks, with clear top-1 gains on LTCC and PRCC and robust performance across datasets, while inference relies only on the visual stream. This framework demonstrates the practical value of training-time textual guidance for robust identity recognition under clothing variation, and it discusses limitations and ethical considerations for real-world deployment.

Abstract

Clothes-changing person re-identification (CC-ReID) aims to recognize individuals under different clothing scenarios. Current CC-ReID approaches either concentrate on modeling body shape using additional modalities including silhouette, pose, and body mesh, potentially causing the model to overlook other critical biometric traits such as gender, age, and style, or they incorporate supervision through additional labels that the model tries to disregard or emphasize, such as clothing or personal attributes. However, these annotations are discrete in nature and do not capture comprehensive descriptions. In this work, we propose DIFFER: Disentangle Identity Features From Entangled Representations, a novel adversarial learning method that leverages textual descriptions to disentangle identity features. Recognizing that image features inherently mix inseparable information, DIFFER introduces NBDetach, a mechanism designed for feature disentanglement by leveraging the separable nature of text descriptions as supervision. It partitions the feature space into distinct subspaces and, through gradient reversal layers, effectively separates identity-related features from non-biometric features. We evaluate DIFFER on 4 different benchmark datasets (LTCC, PRCC, CelebreID-Light, and CCVID) to demonstrate its effectiveness and provide state-of-the-art performance across all the benchmarks. DIFFER consistently outperforms the baseline method, with improvements in top-1 accuracy of 3.6% on LTCC, 3.4% on PRCC, 2.5% on CelebReID-Light, and 1% on CCVID. Our code can be found here.

DIFFER: Disentangling Identity Features via Semantic Cues for Clothes-Changing Person Re-ID

TL;DR

DIFFER tackles clothes-changing person re-identification by disentangling biometric identity cues from non-biometric appearance factors using textual supervision. It leverages a visual encoder, semantic learning from VLM-generated biometric and non-biometric descriptions, and NBDetach with a gradient reversal layer to align image and text in disentangled subspaces. The approach achieves state-of-the-art results on four CC-ReID benchmarks, with clear top-1 gains on LTCC and PRCC and robust performance across datasets, while inference relies only on the visual stream. This framework demonstrates the practical value of training-time textual guidance for robust identity recognition under clothing variation, and it discusses limitations and ethical considerations for real-world deployment.

Abstract

Clothes-changing person re-identification (CC-ReID) aims to recognize individuals under different clothing scenarios. Current CC-ReID approaches either concentrate on modeling body shape using additional modalities including silhouette, pose, and body mesh, potentially causing the model to overlook other critical biometric traits such as gender, age, and style, or they incorporate supervision through additional labels that the model tries to disregard or emphasize, such as clothing or personal attributes. However, these annotations are discrete in nature and do not capture comprehensive descriptions. In this work, we propose DIFFER: Disentangle Identity Features From Entangled Representations, a novel adversarial learning method that leverages textual descriptions to disentangle identity features. Recognizing that image features inherently mix inseparable information, DIFFER introduces NBDetach, a mechanism designed for feature disentanglement by leveraging the separable nature of text descriptions as supervision. It partitions the feature space into distinct subspaces and, through gradient reversal layers, effectively separates identity-related features from non-biometric features. We evaluate DIFFER on 4 different benchmark datasets (LTCC, PRCC, CelebreID-Light, and CCVID) to demonstrate its effectiveness and provide state-of-the-art performance across all the benchmarks. DIFFER consistently outperforms the baseline method, with improvements in top-1 accuracy of 3.6% on LTCC, 3.4% on PRCC, 2.5% on CelebReID-Light, and 1% on CCVID. Our code can be found here.

Paper Structure

This paper contains 29 sections, 6 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Motivation behind DIFFER. Encoded image features often entangle a multitude of information. By leveraging the separable nature of text descriptions, we guide the model to disentangle the feature space into distinct subspaces. This allows us to detach non-biometric factors and preserve the crucial biometric features.
  • Figure 2: Overview of DIFFER: First, we use VLM with DIFFERent task prompts to generate DIFFERent textual captions in the perspective of biometrics and non-biometrics factors, which are encoded as textual features, $f^t_{b}, f^t_{n_1},...,f^t_{n_N}$. Next, the image is encoded as image feature $f^i$ in an entangled feature space, and camera side information embedding (Camera SIE) is added during positional embedding. Finally, NBDetach is proposed to detach the non-biometric information and preserve the biometric information. Fully linear projection layers $W$ are used to project the image feature to DIFFERent subspaces. Then GLR (Gradient Reverse Layer) is added to disentangle the non-biometric information from the entire feature space supervised by non-biometric contrastive losses $\mathcal{L}_{\text{C}_{n}}$.
  • Figure 3: Clothing textual feature cluster visualization results. We visualize the cluster results of different images based on the clothing textual feature. As shown in the figure, images with similar outfits are successfully grouped together.
  • Figure 4: The Top-10 retrieval results from LTCC dataset with baseline and DIFFER. Each row displays the ranked results for a query image (left), followed by Rank-1 to Rank-10 retrieval images from left to right. The first row shows baseline results, and the second shows DIFFER results, with correct matches in blue boxes and incorrect matches in red.
  • Figure :
  • ...and 3 more figures