Table of Contents
Fetching ...

Enhancing Visible-Infrared Person Re-identification with Modality- and Instance-aware Visual Prompt Learning

Ruiqi Wu, Bingliang Jiao, Wenxuan Wang, Meng Liu, Peng Wang

TL;DR

This work tackles the VI ReID challenge by leveraging both modality-invariant and modality-specific cues. It introduces MIP, a vision-transformer-based network augmented with Modality-aware Prompt Learning (MPL) and Instance-aware Prompt Generator (IPG) to adapt to different modalities and individual identities, reinforced by an Instance-aware Enhancement Loss (IAEL). The proposed framework yields state-of-the-art results on SYSU-MM01 and RegDB, demonstrating that modality-specific and instance-specific prompts can effectively bridge cross-modality gaps and enhance discriminative power. This approach provides a practical, scalable direction for cross-modality person re-identification with potential extensions to other multi-modal recognition tasks.

Abstract

The Visible-Infrared Person Re-identification (VI ReID) aims to match visible and infrared images of the same pedestrians across non-overlapped camera views. These two input modalities contain both invariant information, such as shape, and modality-specific details, such as color. An ideal model should utilize valuable information from both modalities during training for enhanced representational capability. However, the gap caused by modality-specific information poses substantial challenges for the VI ReID model to handle distinct modality inputs simultaneously. To address this, we introduce the Modality-aware and Instance-aware Visual Prompts (MIP) network in our work, designed to effectively utilize both invariant and specific information for identification. Specifically, our MIP model is built on the transformer architecture. In this model, we have designed a series of modality-specific prompts, which could enable our model to adapt to and make use of the specific information inherent in different modality inputs, thereby reducing the interference caused by the modality gap and achieving better identification. Besides, we also employ each pedestrian feature to construct a group of instance-specific prompts. These customized prompts are responsible for guiding our model to adapt to each pedestrian instance dynamically, thereby capturing identity-level discriminative clues for identification. Through extensive experiments on SYSU-MM01 and RegDB datasets, the effectiveness of both our designed modules is evaluated. Additionally, our proposed MIP performs better than most state-of-the-art methods.

Enhancing Visible-Infrared Person Re-identification with Modality- and Instance-aware Visual Prompt Learning

TL;DR

This work tackles the VI ReID challenge by leveraging both modality-invariant and modality-specific cues. It introduces MIP, a vision-transformer-based network augmented with Modality-aware Prompt Learning (MPL) and Instance-aware Prompt Generator (IPG) to adapt to different modalities and individual identities, reinforced by an Instance-aware Enhancement Loss (IAEL). The proposed framework yields state-of-the-art results on SYSU-MM01 and RegDB, demonstrating that modality-specific and instance-specific prompts can effectively bridge cross-modality gaps and enhance discriminative power. This approach provides a practical, scalable direction for cross-modality person re-identification with potential extensions to other multi-modal recognition tasks.

Abstract

The Visible-Infrared Person Re-identification (VI ReID) aims to match visible and infrared images of the same pedestrians across non-overlapped camera views. These two input modalities contain both invariant information, such as shape, and modality-specific details, such as color. An ideal model should utilize valuable information from both modalities during training for enhanced representational capability. However, the gap caused by modality-specific information poses substantial challenges for the VI ReID model to handle distinct modality inputs simultaneously. To address this, we introduce the Modality-aware and Instance-aware Visual Prompts (MIP) network in our work, designed to effectively utilize both invariant and specific information for identification. Specifically, our MIP model is built on the transformer architecture. In this model, we have designed a series of modality-specific prompts, which could enable our model to adapt to and make use of the specific information inherent in different modality inputs, thereby reducing the interference caused by the modality gap and achieving better identification. Besides, we also employ each pedestrian feature to construct a group of instance-specific prompts. These customized prompts are responsible for guiding our model to adapt to each pedestrian instance dynamically, thereby capturing identity-level discriminative clues for identification. Through extensive experiments on SYSU-MM01 and RegDB datasets, the effectiveness of both our designed modules is evaluated. Additionally, our proposed MIP performs better than most state-of-the-art methods.
Paper Structure (17 sections, 10 equations, 4 figures, 5 tables)

This paper contains 17 sections, 10 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The illustration of our motivation. (a) Our motivation is to utilize modality-specific details to reveal potential relationships. For example, characteristics like clothing types/materials, discerned from visible images, could influence the heat radiation intensities, thereby impacting infrared brightness variations. For intuitive example, in case 1, a uniform brightness is expected as the striped plaid shirt suggests similar materials. Conversely, Case 3 exhibits varied brightness due to different materials in the T-shirt and logo. (b) Unlike traditional methods emphasizing modality-invariant information while overlooking modality-specific information, our approach integrates modality-specific attributes like color, texture, and brightness to extract and explore potential relationships.
  • Figure 2: The overall framework of our proposed MIP network, which consists of a backbone model and two major modules. (a) A pre-trained vision transformer dosovitskiy2020image is used as the backbone model. (b) Modality-aware Prompts Learning (MPL) module produces modality-specific prompts for input visual embeddings of each layer according to the modality labels of input images. (c) Instance-aware Prompts Generator (IPG) module generates instance-specific prompts, and the generated prompts are supervised by "IAEL loss". The "IAEL Loss" is our proposed Instance-aware Enhancement Loss. The two kinds of rompts help the backbone network to adapt to different modality and instance inputs.
  • Figure 3: The t-SNE visualizations results of prompts from fusion-based and generation-based IPG modules. Different colors represent distinct identities. (a) Fusion-based IPG prompts cluster closely, with less obvious boundaries between individuals, indicating weaker instance-aware ability. (b) Generation-based IPG prompts show increased distances between individuals, reflecting stronger instance-aware ability, crucial for effective adaptation to diverse instances.
  • Figure 4: The visualizations results of attention maps of our MPL module and baseline model. From the second column in each case, we could find that the baseline model tends only to capture the explicit correspondence between different modality inputs. Such as only focusing on the upper dress part while ignoring the skirt part in case (a). As for our MPL module, with the carefully designed modality-specific prompts, it could effectively adapt to and make use of the modality-specific information. This enables our MPL model to explore and capture the implicit correspondence between the skirts part. Case (b) shows a similar result.