Dynamic Modality-Camera Invariant Clustering for Unsupervised Visible-Infrared Person Re-identification
Yiming Yang, Weipeng Hu, Haifeng Hu
TL;DR
DMIC tackles unsupervised VI-ReID by aligning clustering across both modality and camera using Modality-Camera Invariant Expansion (MIE), Dynamic Neighborhood Clustering (DNC), and Hybrid Modality Contrastive Learning (HMCL). MIE fuses inter-modal and inter-camera distance encodings to create modality-camera invariant embeddings; DNC dynamically tunes clustering objectives via $\epsilon$ and $k_2$ to balance discriminability and generalization; HMCL uses intra- and inter-modality memories with cluster-level and instance-level contrastive losses to refine distributions. Experiments on SYSU-MM01 and RegDB show competitive results, reducing the gap to supervised methods while remaining efficient. This approach provides a scalable and effective framework for unsupervised cross-modal, cross-camera person re-identification.
Abstract
Unsupervised learning visible-infrared person re-identification (USL-VI-ReID) offers a more flexible and cost-effective alternative compared to supervised methods. This field has gained increasing attention due to its promising potential. Existing methods simply cluster modality-specific samples and employ strong association techniques to achieve instance-to-cluster or cluster-to-cluster cross-modality associations. However, they ignore cross-camera differences, leading to noticeable issues with excessive splitting of identities. Consequently, this undermines the accuracy and reliability of cross-modal associations. To address these issues, we propose a novel Dynamic Modality-Camera Invariant Clustering (DMIC) framework for USL-VI-ReID. Specifically, our DMIC naturally integrates Modality-Camera Invariant Expansion (MIE), Dynamic Neighborhood Clustering (DNC) and Hybrid Modality Contrastive Learning (HMCL) into a unified framework, which eliminates both the cross-modality and cross-camera discrepancies in clustering. MIE fuses inter-modal and inter-camera distance coding to bridge the gaps between modalities and cameras at the clustering level. DNC employs two dynamic search strategies to refine the network's optimization objective, transitioning from improving discriminability to enhancing cross-modal and cross-camera generalizability. Moreover, HMCL is designed to optimize instance-level and cluster-level distributions. Memories for intra-modality and inter-modality training are updated using randomly selected samples, facilitating real-time exploration of modality-invariant representations. Extensive experiments have demonstrated that our DMIC addresses the limitations present in current clustering approaches and achieve competitive performance, which significantly reduces the performance gap with supervised methods.
