Table of Contents
Fetching ...

WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared Person Re-Identification

Yonggan Wu, Ling-Chao Meng, Yuan Zichao, Sixian Chan, Hong-Qiang Wang

TL;DR

This paper tackles cross-modality gaps in visible–infrared person re-identification by introducing WRIM-Net, a framework that mines wide-ranging information through a Multi-dimension Interactive Information Mining (MIIM) module and an Auxiliary-Information-based Contrastive Learning (AICL) approach. MIIM enables non-local spatial and channel interactions, with separate modules in shallow layers for specific-modality information and a shared module in deeper layers for shared-modality information, boosted by Global Region Interaction. AICL leverages Cross-Modality Key-Instance Contrastive (CMKIC) loss to pull same-ID samples across modalities closer while challenging the model with top-K difficult positives, supplemented by auxiliary information from earlier blocks. The method achieves state-of-the-art results on SYSU-MM01, RegDB, and LLCM, demonstrating strong improvements in cross-modality invariant feature learning and practical VI-ReID performance.

Abstract

For the visible-infrared person re-identification (VI-ReID) task, one of the primary challenges lies in significant cross-modality discrepancy. Existing methods struggle to conduct modality-invariant information mining. They often focus solely on mining singular dimensions like spatial or channel, and overlook the extraction of specific-modality multi-dimension information. To fully mine modality-invariant information across a wide range, we introduce the Wide-Ranging Information Mining Network (WRIM-Net), which mainly comprises a Multi-dimension Interactive Information Mining (MIIM) module and an Auxiliary-Information-based Contrastive Learning (AICL) approach. Empowered by the proposed Global Region Interaction (GRI), MIIM comprehensively mines non-local spatial and channel information through intra-dimension interaction. Moreover, Thanks to the low computational complexity design, separate MIIM can be positioned in shallow layers, enabling the network to better mine specific-modality multi-dimension information. AICL, by introducing the novel Cross-Modality Key-Instance Contrastive (CMKIC) loss, effectively guides the network in extracting modality-invariant information. We conduct extensive experiments not only on the well-known SYSU-MM01 and RegDB datasets but also on the latest large-scale cross-modality LLCM dataset. The results demonstrate WRIM-Net's superiority over state-of-the-art methods.

WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared Person Re-Identification

TL;DR

This paper tackles cross-modality gaps in visible–infrared person re-identification by introducing WRIM-Net, a framework that mines wide-ranging information through a Multi-dimension Interactive Information Mining (MIIM) module and an Auxiliary-Information-based Contrastive Learning (AICL) approach. MIIM enables non-local spatial and channel interactions, with separate modules in shallow layers for specific-modality information and a shared module in deeper layers for shared-modality information, boosted by Global Region Interaction. AICL leverages Cross-Modality Key-Instance Contrastive (CMKIC) loss to pull same-ID samples across modalities closer while challenging the model with top-K difficult positives, supplemented by auxiliary information from earlier blocks. The method achieves state-of-the-art results on SYSU-MM01, RegDB, and LLCM, demonstrating strong improvements in cross-modality invariant feature learning and practical VI-ReID performance.

Abstract

For the visible-infrared person re-identification (VI-ReID) task, one of the primary challenges lies in significant cross-modality discrepancy. Existing methods struggle to conduct modality-invariant information mining. They often focus solely on mining singular dimensions like spatial or channel, and overlook the extraction of specific-modality multi-dimension information. To fully mine modality-invariant information across a wide range, we introduce the Wide-Ranging Information Mining Network (WRIM-Net), which mainly comprises a Multi-dimension Interactive Information Mining (MIIM) module and an Auxiliary-Information-based Contrastive Learning (AICL) approach. Empowered by the proposed Global Region Interaction (GRI), MIIM comprehensively mines non-local spatial and channel information through intra-dimension interaction. Moreover, Thanks to the low computational complexity design, separate MIIM can be positioned in shallow layers, enabling the network to better mine specific-modality multi-dimension information. AICL, by introducing the novel Cross-Modality Key-Instance Contrastive (CMKIC) loss, effectively guides the network in extracting modality-invariant information. We conduct extensive experiments not only on the well-known SYSU-MM01 and RegDB datasets but also on the latest large-scale cross-modality LLCM dataset. The results demonstrate WRIM-Net's superiority over state-of-the-art methods.
Paper Structure (17 sections, 9 equations, 5 figures, 10 tables)

This paper contains 17 sections, 9 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Example of noticeable modality discrepancy between cross-modality images. The motivation behind the proposed WRIM-Net is to mine modality-invariant information across a wide range (e.g. non-local spatial interaction, channel interaction, specific-modality, shared-modality) and to guide the network in better mining invariant information through a novel cross-modality loss.
  • Figure 2: Framework of WRIM-Net. Two separate MIIMs are inserted after each of the first two blocks of the network and a shared MIIM is inserted after each of the last two blocks of the network. Apart from separate MIIM, all other network parameters are shared. AICL uses the traditional ID loss after Block 3 of the network and CMKIC loss after Block 4 of the network.
  • Figure 3: Diagram of MIIM module. The input features first pass through a standard batch normalization layer and then pass through the Spatial-Channel Compress (SCC) component to compress the size. Subsequently, the feature is passed to the Global Region Interaction (GRI) component, which employs Multi Head Attention (MHA). Finally, the feature weights are restored to the same size as the input features through the Spatial-Channel Restore (SCR) component. $\otimes$ denotes element-wise multiplication.
  • Figure 4: Grad-Cam feature visualization analysis of MIIM. The second column shows the Visualization Heat Map from the baseline network, while the third column displays the Visualization Heat Map with the MIIM module integrated.
  • Figure 5: T-SNE feature visualization. Each color represents an ID and the circles and triangles represent different modalities. As can be seen, AICL better alleviates the modality discrepancy and improves the discriminability.