Table of Contents
Fetching ...

Unsupervised Visible-Infrared ReID via Pseudo-label Correction and Modality-level Alignment

Yexin Liu, Weiming Zhang, Athanasios V. Vasilakos, Lin Wang

TL;DR

This work tackles unsupervised visible–infrared person re-identification by addressing two core issues: noisy pseudo labels from intra-modality clustering and cross-modality misalignment between visible and infrared features. It introduces PRAISE, a theory-informed framework that combines Pseudo-Label Correction (PLC) using a Beta Mixture Model to weigh pseudo-label noise in a perceptual-contrastive loss, with Modality-level Alignment (MLA) that employs bi-directional latent translation and centroid-based matching (SFM) plus CMA and LFC losses to align modalities and label functions. A generalization bound based on ${\cal H}$-divergence and empirical Rademacher complexity motivates the dual focus on reducing intra-modality errors and enforcing cross-modal alignment. Empirically, PRAISE achieves state-of-the-art performance among fully unsupervised VI-ReID methods on SYSU-MM01 and RegDB, approaching supervised VI-ReID at higher ranks and offering a practical avenue for cross-modal person re-identification without paired annotations.

Abstract

Unsupervised visible-infrared person re-identification (UVI-ReID) has recently gained great attention due to its potential for enhancing human detection in diverse environments without labeling. Previous methods utilize intra-modality clustering and cross-modality feature matching to achieve UVI-ReID. However, there exist two challenges: 1) noisy pseudo labels might be generated in the clustering process, and 2) the cross-modality feature alignment via matching the marginal distribution of visible and infrared modalities may misalign the different identities from two modalities. In this paper, we first conduct a theoretic analysis where an interpretable generalization upper bound is introduced. Based on the analysis, we then propose a novel unsupervised cross-modality person re-identification framework (PRAISE). Specifically, to address the first challenge, we propose a pseudo-label correction strategy that utilizes a Beta Mixture Model to predict the probability of mis-clustering based network's memory effect and rectifies the correspondence by adding a perceptual term to contrastive learning. Next, we introduce a modality-level alignment strategy that generates paired visible-infrared latent features and reduces the modality gap by aligning the labeling function of visible and infrared features to learn identity discriminative and modality-invariant features. Experimental results on two benchmark datasets demonstrate that our method achieves state-of-the-art performance than the unsupervised visible-ReID methods.

Unsupervised Visible-Infrared ReID via Pseudo-label Correction and Modality-level Alignment

TL;DR

This work tackles unsupervised visible–infrared person re-identification by addressing two core issues: noisy pseudo labels from intra-modality clustering and cross-modality misalignment between visible and infrared features. It introduces PRAISE, a theory-informed framework that combines Pseudo-Label Correction (PLC) using a Beta Mixture Model to weigh pseudo-label noise in a perceptual-contrastive loss, with Modality-level Alignment (MLA) that employs bi-directional latent translation and centroid-based matching (SFM) plus CMA and LFC losses to align modalities and label functions. A generalization bound based on -divergence and empirical Rademacher complexity motivates the dual focus on reducing intra-modality errors and enforcing cross-modal alignment. Empirically, PRAISE achieves state-of-the-art performance among fully unsupervised VI-ReID methods on SYSU-MM01 and RegDB, approaching supervised VI-ReID at higher ranks and offering a practical avenue for cross-modal person re-identification without paired annotations.

Abstract

Unsupervised visible-infrared person re-identification (UVI-ReID) has recently gained great attention due to its potential for enhancing human detection in diverse environments without labeling. Previous methods utilize intra-modality clustering and cross-modality feature matching to achieve UVI-ReID. However, there exist two challenges: 1) noisy pseudo labels might be generated in the clustering process, and 2) the cross-modality feature alignment via matching the marginal distribution of visible and infrared modalities may misalign the different identities from two modalities. In this paper, we first conduct a theoretic analysis where an interpretable generalization upper bound is introduced. Based on the analysis, we then propose a novel unsupervised cross-modality person re-identification framework (PRAISE). Specifically, to address the first challenge, we propose a pseudo-label correction strategy that utilizes a Beta Mixture Model to predict the probability of mis-clustering based network's memory effect and rectifies the correspondence by adding a perceptual term to contrastive learning. Next, we introduce a modality-level alignment strategy that generates paired visible-infrared latent features and reduces the modality gap by aligning the labeling function of visible and infrared features to learn identity discriminative and modality-invariant features. Experimental results on two benchmark datasets demonstrate that our method achieves state-of-the-art performance than the unsupervised visible-ReID methods.
Paper Structure (15 sections, 12 equations, 6 figures, 5 tables)

This paper contains 15 sections, 12 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustration of the proposed two strategies. (a) pseudo label correction, (b) modality-level alignment.
  • Figure 2: Overview of our method. The encoder extracts the features of visible and infrared images. Then, the extracted features in one modality are translated into another modality. Our method contains two key strategies: pseudo-label correction (PLC) and modality-level alignment (MLA). PLC enables us to deal with noisy pseudo labels and learn more robust discriminative feature. MLA enables us to reduce modality discrepancies.
  • Figure 3: The structure of the proposed sequential filtering matching.
  • Figure 4: The illustration of the proposed MLA losses. We apply $\mathcal{L}_{cma}$ for cross-modal matching based on feature matching results and clustering centers. Additionally, $\mathcal{L}_{lfc}$ is used to prevent incorrect grouping of different human features from the two modalities.
  • Figure 5: TSNE visualization of before MLA (left) and after MLA (right) in the same batch.
  • ...and 1 more figures