Table of Contents
Fetching ...

Mutual Information Guided Optimal Transport for Unsupervised Visible-Infrared Person Re-identification

Zhizhong Zhang, Jiangming Wang, Xin Tan, Yanyun Qu, Junping Wang, Yong Xie, Yuan Xie

TL;DR

This work tackles unsupervised visible-infrared person re-identification by grounding cross-modal learning in mutual information. It derives three guiding principles—Sharpness, Fairness, and Fitness—and implements a looped training regime that alternates model updates with cross-modality prototype matching via a uniform-prior OT assignment (OTPA). Prototype-based contrastive learning (PBCL) and cross prediction alignment (CPAL) exploit the cross-modality correspondence to minimize intra- and cross-modality entropy, achieving strong results on SYSU-MM01 and RegDB without labels. The approach demonstrates notable improvements over prior USVI-ReID methods and competitive performance against supervised VI-ReID, with efficient computation and robustness to incomplete cross-modality overlap. These contributions advance unsupervised cross-modal learning by integrating MI theory, OT optimization, and prototype-based representation learning into a cohesive framework.

Abstract

Unsupervised visible infrared person re-identification (USVI-ReID) is a challenging retrieval task that aims to retrieve cross-modality pedestrian images without using any label information. In this task, the large cross-modality variance makes it difficult to generate reliable cross-modality labels, and the lack of annotations also provides additional difficulties for learning modality-invariant features. In this paper, we first deduce an optimization objective for unsupervised VI-ReID based on the mutual information between the model's cross-modality input and output. With equivalent derivation, three learning principles, i.e., "Sharpness" (entropy minimization), "Fairness" (uniform label distribution), and "Fitness" (reliable cross-modality matching) are obtained. Under their guidance, we design a loop iterative training strategy alternating between model training and cross-modality matching. In the matching stage, a uniform prior guided optimal transport assignment ("Fitness", "Fairness") is proposed to select matched visible and infrared prototypes. In the training stage, we utilize this matching information to introduce prototype-based contrastive learning for minimizing the intra- and cross-modality entropy ("Sharpness"). Extensive experimental results on benchmarks demonstrate the effectiveness of our method, e.g., 60.6% and 90.3% of Rank-1 accuracy on SYSU-MM01 and RegDB without any annotations.

Mutual Information Guided Optimal Transport for Unsupervised Visible-Infrared Person Re-identification

TL;DR

This work tackles unsupervised visible-infrared person re-identification by grounding cross-modal learning in mutual information. It derives three guiding principles—Sharpness, Fairness, and Fitness—and implements a looped training regime that alternates model updates with cross-modality prototype matching via a uniform-prior OT assignment (OTPA). Prototype-based contrastive learning (PBCL) and cross prediction alignment (CPAL) exploit the cross-modality correspondence to minimize intra- and cross-modality entropy, achieving strong results on SYSU-MM01 and RegDB without labels. The approach demonstrates notable improvements over prior USVI-ReID methods and competitive performance against supervised VI-ReID, with efficient computation and robustness to incomplete cross-modality overlap. These contributions advance unsupervised cross-modal learning by integrating MI theory, OT optimization, and prototype-based representation learning into a cohesive framework.

Abstract

Unsupervised visible infrared person re-identification (USVI-ReID) is a challenging retrieval task that aims to retrieve cross-modality pedestrian images without using any label information. In this task, the large cross-modality variance makes it difficult to generate reliable cross-modality labels, and the lack of annotations also provides additional difficulties for learning modality-invariant features. In this paper, we first deduce an optimization objective for unsupervised VI-ReID based on the mutual information between the model's cross-modality input and output. With equivalent derivation, three learning principles, i.e., "Sharpness" (entropy minimization), "Fairness" (uniform label distribution), and "Fitness" (reliable cross-modality matching) are obtained. Under their guidance, we design a loop iterative training strategy alternating between model training and cross-modality matching. In the matching stage, a uniform prior guided optimal transport assignment ("Fitness", "Fairness") is proposed to select matched visible and infrared prototypes. In the training stage, we utilize this matching information to introduce prototype-based contrastive learning for minimizing the intra- and cross-modality entropy ("Sharpness"). Extensive experimental results on benchmarks demonstrate the effectiveness of our method, e.g., 60.6% and 90.3% of Rank-1 accuracy on SYSU-MM01 and RegDB without any annotations.
Paper Structure (41 sections, 39 equations, 12 figures, 6 tables)

This paper contains 41 sections, 39 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: (a) One way to realize the unsupervised learning is to maximize the mutual information between the model's input $\boldsymbol{x}$ and output $\boldsymbol{y}$ReMixMatch, (i.e., "Sharpness" and "Fairness"). (b) For unsupervised VI-ReID task, we additionally mine the modality-consistent mutual information (i.e., maximize the mutual information between the matched cross-modality input pairs $\boldsymbol{x}^{v},\boldsymbol{x}^{r}$ and the corresponding outputs $\boldsymbol{y}^{v},\boldsymbol{y}^{r}$, where $v$ and $r$ denote visible and infrared modality respectively) (i.e., "Fitness").
  • Figure 2: The pipeline of our framework. Discrepancy Elimination Network (DEN) extracts robust features for unlabeled data. After that, we use DBSCAN to generate clustering pseudo labels of each modality and create visible and infrared prototype memories. The Optimal Transport Prototype Assignment (OTPA) is then proposed to find the cross-modality correspondence. This correspondence helps us to obtain the cross-modality pseudo labels and create the cross prototype memory. Based on these memories, we design various Prototype-Based Contrastive Learning losses (PBCL), including VCL, ICL, CCL, and MCL, to minimize intra- and cross-modality entropy. We also use a Cross Prediction Alignment Learning (CPAL) referring to OTLA-ReID OTLA-ReID to reduce negative effects brought by inaccurately matched cross-modality data.
  • Figure 3: Cross-Modality Pseudo Label Generation. Degraded Solution: Without uniform prior, it would meet collapsed solution. Uniform Prior: OTPA benefits from the balanced assignment with the uniform prior, but there are still unassigned demand prototypes, e.g., node $10$. Opposite Transport: We conduct an opposite transport and find the matched nodes with respect to the unassigned demand prototypes, e.g., node $3$. Matching: final cross-modality correspondence.
  • Figure 4: The diagram of prototype-based contrastive learning losses, including VCL, ICL, CCL, and MCL. These losses can be the substitutes for entropy minimization due to the lack of a proper classifier.
  • Figure 5: Visualization of optimal transport prototype assignment (OTPA) algorithm. We visualize the normalized cosine similarity matrix $\frac{\boldsymbol{S}+1}{2}$ (input of OTPA algorithm ranged from $[0,1]$) and the optimal transport plan $\hat{\boldsymbol{Q}}$ (output of OTPA algorithm). The horizontal and vertical coordinates of each visualized matrix denote visible pseudo classes and infrared pseudo classes, respectively.
  • ...and 7 more figures