Table of Contents
Fetching ...

X-ReID: Multi-granularity Information Interaction for Video-Based Visible-Infrared Person Re-Identification

Chenyang Yu, Xuehu Liu, Pingping Zhang, Huchuan Lu

TL;DR

X-ReID addresses the VVI-ReID problem by combining Cross-modality Prototype Collaboration (CPC) with Multi-granularity Information Interaction (MII). CPC leverages a CLIP-based memory of identity prototypes and cross-modal updates to reduce modality gap, while MII captures short-term and long-term temporal information and performs cross-modality feature interaction, enforced by a Cross-Modality Constraint Loss. The overall training minimizes $L_{total}=L_{CPCL}+L_{tri}+L_{ce}+L_{CMCL}$, and inference excludes CII for efficiency while concatenating multi-scale temporal features. Experiments on HITSZ-VCM and BUPTCampus show state-of-the-art performance, validating both modality alignment and temporal modeling capabilities; the work provides a public implementation.

Abstract

Large-scale vision-language models (e.g., CLIP) have recently achieved remarkable performance in retrieval tasks, yet their potential for Video-based Visible-Infrared Person Re-Identification (VVI-ReID) remains largely unexplored. The primary challenges are narrowing the modality gap and leveraging spatiotemporal information in video sequences. To address the above issues, in this paper, we propose a novel cross-modality feature learning framework named X-ReID for VVI-ReID. Specifically, we first propose a Cross-modality Prototype Collaboration (CPC) to align and integrate features from different modalities, guiding the network to reduce the modality discrepancy. Then, a Multi-granularity Information Interaction (MII) is designed, incorporating short-term interactions from adjacent frames, long-term cross-frame information fusion, and cross-modality feature alignment to enhance temporal modeling and further reduce modality gaps. Finally, by integrating multi-granularity information, a robust sequence-level representation is achieved. Extensive experiments on two large-scale VVI-ReID benchmarks (i.e., HITSZ-VCM and BUPTCampus) demonstrate the superiority of our method over state-of-the-art methods. The source code is released at https://github.com/AsuradaYuci/X-ReID.

X-ReID: Multi-granularity Information Interaction for Video-Based Visible-Infrared Person Re-Identification

TL;DR

X-ReID addresses the VVI-ReID problem by combining Cross-modality Prototype Collaboration (CPC) with Multi-granularity Information Interaction (MII). CPC leverages a CLIP-based memory of identity prototypes and cross-modal updates to reduce modality gap, while MII captures short-term and long-term temporal information and performs cross-modality feature interaction, enforced by a Cross-Modality Constraint Loss. The overall training minimizes , and inference excludes CII for efficiency while concatenating multi-scale temporal features. Experiments on HITSZ-VCM and BUPTCampus show state-of-the-art performance, validating both modality alignment and temporal modeling capabilities; the work provides a public implementation.

Abstract

Large-scale vision-language models (e.g., CLIP) have recently achieved remarkable performance in retrieval tasks, yet their potential for Video-based Visible-Infrared Person Re-Identification (VVI-ReID) remains largely unexplored. The primary challenges are narrowing the modality gap and leveraging spatiotemporal information in video sequences. To address the above issues, in this paper, we propose a novel cross-modality feature learning framework named X-ReID for VVI-ReID. Specifically, we first propose a Cross-modality Prototype Collaboration (CPC) to align and integrate features from different modalities, guiding the network to reduce the modality discrepancy. Then, a Multi-granularity Information Interaction (MII) is designed, incorporating short-term interactions from adjacent frames, long-term cross-frame information fusion, and cross-modality feature alignment to enhance temporal modeling and further reduce modality gaps. Finally, by integrating multi-granularity information, a robust sequence-level representation is achieved. Extensive experiments on two large-scale VVI-ReID benchmarks (i.e., HITSZ-VCM and BUPTCampus) demonstrate the superiority of our method over state-of-the-art methods. The source code is released at https://github.com/AsuradaYuci/X-ReID.

Paper Structure

This paper contains 16 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of our motivations.
  • Figure 2: Illustration of the proposed X-ReID framework.
  • Figure 3: Illustration of our MII.
  • Figure 4: Illustration of the impact of time stride $S$ in SII on HITSZ-VCM under the I2V setting.
  • Figure 5: Illustration of the impact of time stride $S$ in LII on HITSZ-VCM under the I2V setting.