Extended Cross-Modality United Learning for Unsupervised Visible-Infrared Person Re-identification
Ruixing Wu, Yiming Yang, Jiakai He, Haifeng Hu
TL;DR
This work tackles unsupervised visible-infrared person re-identification by addressing cross-modality clustering challenges and inter-modality gaps. It proposes Extended Cross-Modality United Learning (ECUL), which fuses cluster-level and instance-level contrastive learning with cross-modal memory aggregation, and introduces two novel modules: Extended Modality-Camera Clustering (EMCC) and Two-Step Memory Updating (TSMem). EMCC refines clustering by enforcing camera- and modality-aware constraints and by using a two-tier neighborhood (k2 and k3) to filter negatives while fusing positives. TSMem updates memory in two stages to preserve diversity early in training and enhance generalization later. Experiments on SYSU-MM01 and RegDB show ECUL achieving state-of-the-art or competitive results among unsupervised methods and surpassing several supervised approaches, demonstrating annotation-free viability for cross-modality Re-ID with practical security implications.
Abstract
Unsupervised learning visible-infrared person re-identification (USL-VI-ReID) aims to learn modality-invariant features from unlabeled cross-modality datasets and reduce the inter-modality gap. However, the existing methods lack cross-modality clustering or excessively pursue cluster-level association, which makes it difficult to perform reliable modality-invariant features learning. To deal with this issue, we propose a Extended Cross-Modality United Learning (ECUL) framework, incorporating Extended Modality-Camera Clustering (EMCC) and Two-Step Memory Updating Strategy (TSMem) modules. Specifically, we design ECUL to naturally integrates intra-modality clustering, inter-modality clustering and inter-modality instance selection, establishing compact and accurate cross-modality associations while reducing the introduction of noisy labels. Moreover, EMCC captures and filters the neighborhood relationships by extending the encoding vector, which further promotes the learning of modality-invariant and camera-invariant knowledge in terms of clustering algorithm. Finally, TSMem provides accurate and generalized proxy points for contrastive learning by updating the memory in stages. Extensive experiments results on SYSU-MM01 and RegDB datasets demonstrate that the proposed ECUL shows promising performance and even outperforms certain supervised methods.
