Table of Contents
Fetching ...

Dynamic Modality-Camera Invariant Clustering for Unsupervised Visible-Infrared Person Re-identification

Yiming Yang, Weipeng Hu, Haifeng Hu

TL;DR

DMIC tackles unsupervised VI-ReID by aligning clustering across both modality and camera using Modality-Camera Invariant Expansion (MIE), Dynamic Neighborhood Clustering (DNC), and Hybrid Modality Contrastive Learning (HMCL). MIE fuses inter-modal and inter-camera distance encodings to create modality-camera invariant embeddings; DNC dynamically tunes clustering objectives via $\epsilon$ and $k_2$ to balance discriminability and generalization; HMCL uses intra- and inter-modality memories with cluster-level and instance-level contrastive losses to refine distributions. Experiments on SYSU-MM01 and RegDB show competitive results, reducing the gap to supervised methods while remaining efficient. This approach provides a scalable and effective framework for unsupervised cross-modal, cross-camera person re-identification.

Abstract

Unsupervised learning visible-infrared person re-identification (USL-VI-ReID) offers a more flexible and cost-effective alternative compared to supervised methods. This field has gained increasing attention due to its promising potential. Existing methods simply cluster modality-specific samples and employ strong association techniques to achieve instance-to-cluster or cluster-to-cluster cross-modality associations. However, they ignore cross-camera differences, leading to noticeable issues with excessive splitting of identities. Consequently, this undermines the accuracy and reliability of cross-modal associations. To address these issues, we propose a novel Dynamic Modality-Camera Invariant Clustering (DMIC) framework for USL-VI-ReID. Specifically, our DMIC naturally integrates Modality-Camera Invariant Expansion (MIE), Dynamic Neighborhood Clustering (DNC) and Hybrid Modality Contrastive Learning (HMCL) into a unified framework, which eliminates both the cross-modality and cross-camera discrepancies in clustering. MIE fuses inter-modal and inter-camera distance coding to bridge the gaps between modalities and cameras at the clustering level. DNC employs two dynamic search strategies to refine the network's optimization objective, transitioning from improving discriminability to enhancing cross-modal and cross-camera generalizability. Moreover, HMCL is designed to optimize instance-level and cluster-level distributions. Memories for intra-modality and inter-modality training are updated using randomly selected samples, facilitating real-time exploration of modality-invariant representations. Extensive experiments have demonstrated that our DMIC addresses the limitations present in current clustering approaches and achieve competitive performance, which significantly reduces the performance gap with supervised methods.

Dynamic Modality-Camera Invariant Clustering for Unsupervised Visible-Infrared Person Re-identification

TL;DR

DMIC tackles unsupervised VI-ReID by aligning clustering across both modality and camera using Modality-Camera Invariant Expansion (MIE), Dynamic Neighborhood Clustering (DNC), and Hybrid Modality Contrastive Learning (HMCL). MIE fuses inter-modal and inter-camera distance encodings to create modality-camera invariant embeddings; DNC dynamically tunes clustering objectives via and to balance discriminability and generalization; HMCL uses intra- and inter-modality memories with cluster-level and instance-level contrastive losses to refine distributions. Experiments on SYSU-MM01 and RegDB show competitive results, reducing the gap to supervised methods while remaining efficient. This approach provides a scalable and effective framework for unsupervised cross-modal, cross-camera person re-identification.

Abstract

Unsupervised learning visible-infrared person re-identification (USL-VI-ReID) offers a more flexible and cost-effective alternative compared to supervised methods. This field has gained increasing attention due to its promising potential. Existing methods simply cluster modality-specific samples and employ strong association techniques to achieve instance-to-cluster or cluster-to-cluster cross-modality associations. However, they ignore cross-camera differences, leading to noticeable issues with excessive splitting of identities. Consequently, this undermines the accuracy and reliability of cross-modal associations. To address these issues, we propose a novel Dynamic Modality-Camera Invariant Clustering (DMIC) framework for USL-VI-ReID. Specifically, our DMIC naturally integrates Modality-Camera Invariant Expansion (MIE), Dynamic Neighborhood Clustering (DNC) and Hybrid Modality Contrastive Learning (HMCL) into a unified framework, which eliminates both the cross-modality and cross-camera discrepancies in clustering. MIE fuses inter-modal and inter-camera distance coding to bridge the gaps between modalities and cameras at the clustering level. DNC employs two dynamic search strategies to refine the network's optimization objective, transitioning from improving discriminability to enhancing cross-modal and cross-camera generalizability. Moreover, HMCL is designed to optimize instance-level and cluster-level distributions. Memories for intra-modality and inter-modality training are updated using randomly selected samples, facilitating real-time exploration of modality-invariant representations. Extensive experiments have demonstrated that our DMIC addresses the limitations present in current clustering approaches and achieve competitive performance, which significantly reduces the performance gap with supervised methods.

Paper Structure

This paper contains 17 sections, 20 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Illustration of cross-modality and cross-camera discrepancies in clustering. Large variations caused by these discrepancies lead to identities splitting. Fine-tuning the network using these inaccurate labels obtains sub-optimal results.
  • Figure 2: The flowchart of Dynamic Modality-Camera Invariant Clustering (DMIC) model. Our DMIC model is composed of three key modules: Modality-Camera Invariant Expansion (MIE), Dynamic Neighborhood Clustering (DNC), and Hybrid Modality Contrastive Learning (HMCL). MIE fuses the distance encoding from multiple cameras to generate modality-camera invariant embeddings for clustering. DNC employs two dynamic search strategies that optimize the network's performance, transitioning from improving discriminability to enhancing generalization. The estimated pseudo labels from MIE and DNC are used to initialize instance-level and cluster-level memories. HMCL includes intra-modality and inter-modality contrastive learning to learn modality-camera invariant representations. During the testing phase, our framework only utilizes the backbone for testing purposes.
  • Figure 3: Dynamic schedular in DNC. (a) is the dynamic schedular for $eps$ in Eq. \ref{['eq:intraclustering']} and Eq. \ref{['eq:interclustering']}, while (b) is for $k_2$ in Eq. \ref{['eq:intraclustering']}.
  • Figure 4: Illustratiion of the clustering results of DNC. Taking $eps$ for an example, we establish upper and lower dynamic range limits for $eps$ denoted as $\pi_2$ and $\pi_1$, respectively. Initially, $eps$ decreases from $\pi_2$ to $\pi_1$, excluding noisy instances from clusters. Subsequently, $eps$ expands from $\pi_1$ to $\pi_2$, progressively incorporating cross-modality and cross-camera instances into clusters. In this manner, the model first improves discriminability and then gradually develops cross-modality and cross-camera generalizability.
  • Figure 5: Example images from visible-infrared pedestrian databases. The images from the upper row are in visible modality, the images from the second row are in infrared modality, and the images from the last row are in CA modality.
  • ...and 3 more figures