DIFFER: Disentangling Identity Features via Semantic Cues for Clothes-Changing Person Re-ID
Xin Liang, Yogesh S Rawat
TL;DR
DIFFER tackles clothes-changing person re-identification by disentangling biometric identity cues from non-biometric appearance factors using textual supervision. It leverages a visual encoder, semantic learning from VLM-generated biometric and non-biometric descriptions, and NBDetach with a gradient reversal layer to align image and text in disentangled subspaces. The approach achieves state-of-the-art results on four CC-ReID benchmarks, with clear top-1 gains on LTCC and PRCC and robust performance across datasets, while inference relies only on the visual stream. This framework demonstrates the practical value of training-time textual guidance for robust identity recognition under clothing variation, and it discusses limitations and ethical considerations for real-world deployment.
Abstract
Clothes-changing person re-identification (CC-ReID) aims to recognize individuals under different clothing scenarios. Current CC-ReID approaches either concentrate on modeling body shape using additional modalities including silhouette, pose, and body mesh, potentially causing the model to overlook other critical biometric traits such as gender, age, and style, or they incorporate supervision through additional labels that the model tries to disregard or emphasize, such as clothing or personal attributes. However, these annotations are discrete in nature and do not capture comprehensive descriptions. In this work, we propose DIFFER: Disentangle Identity Features From Entangled Representations, a novel adversarial learning method that leverages textual descriptions to disentangle identity features. Recognizing that image features inherently mix inseparable information, DIFFER introduces NBDetach, a mechanism designed for feature disentanglement by leveraging the separable nature of text descriptions as supervision. It partitions the feature space into distinct subspaces and, through gradient reversal layers, effectively separates identity-related features from non-biometric features. We evaluate DIFFER on 4 different benchmark datasets (LTCC, PRCC, CelebreID-Light, and CCVID) to demonstrate its effectiveness and provide state-of-the-art performance across all the benchmarks. DIFFER consistently outperforms the baseline method, with improvements in top-1 accuracy of 3.6% on LTCC, 3.4% on PRCC, 2.5% on CelebReID-Light, and 1% on CCVID. Our code can be found here.
