Table of Contents
Fetching ...

CLIP-Driven Semantic Discovery Network for Visible-Infrared Person Re-Identification

Xiaoyan Yu, Neng Dong, Liehuang Zhu, Hao Peng, Dapeng Tao

TL;DR

This work tackles the modality gap in visible-infrared person re-identification by injecting high-level semantic information derived from CLIP into visual representations. It introduces a CLIP-Driven Semantic Discovery Network (CSDN) with Modality-specific Prompt Learners, Semantic Information Integration, and High-level Semantic Embedding to align cross-modality features without image generation. Ablation and extensive experiments on SYSU-MM01 and RegDB demonstrate that CLIP-VIReID and the full CSDN outperform state-of-the-art non-generative VIReID approaches, with optimal performance achieved when modality-specific prompts are fused via attention. The approach offers a practical, scalable path for robust VIReID in real-world, 24/7 surveillance scenarios and suggests future work on leveraging larger multimodal models.

Abstract

Visible-infrared person re-identification (VIReID) primarily deals with matching identities across person images from different modalities. Due to the modality gap between visible and infrared images, cross-modality identity matching poses significant challenges. Recognizing that high-level semantics of pedestrian appearance, such as gender, shape, and clothing style, remain consistent across modalities, this paper intends to bridge the modality gap by infusing visual features with high-level semantics. Given the capability of CLIP to sense high-level semantic information corresponding to visual representations, we explore the application of CLIP within the domain of VIReID. Consequently, we propose a CLIP-Driven Semantic Discovery Network (CSDN) that consists of Modality-specific Prompt Learner, Semantic Information Integration (SII), and High-level Semantic Embedding (HSE). Specifically, considering the diversity stemming from modality discrepancies in language descriptions, we devise bimodal learnable text tokens to capture modality-private semantic information for visible and infrared images, respectively. Additionally, acknowledging the complementary nature of semantic details across different modalities, we integrate text features from the bimodal language descriptions to achieve comprehensive semantics. Finally, we establish a connection between the integrated text features and the visual features across modalities. This process embed rich high-level semantic information into visual representations, thereby promoting the modality invariance of visual representations. The effectiveness and superiority of our proposed CSDN over existing methods have been substantiated through experimental evaluations on multiple widely used benchmarks. The code will be released at \url{https://github.com/nengdong96/CSDN}.

CLIP-Driven Semantic Discovery Network for Visible-Infrared Person Re-Identification

TL;DR

This work tackles the modality gap in visible-infrared person re-identification by injecting high-level semantic information derived from CLIP into visual representations. It introduces a CLIP-Driven Semantic Discovery Network (CSDN) with Modality-specific Prompt Learners, Semantic Information Integration, and High-level Semantic Embedding to align cross-modality features without image generation. Ablation and extensive experiments on SYSU-MM01 and RegDB demonstrate that CLIP-VIReID and the full CSDN outperform state-of-the-art non-generative VIReID approaches, with optimal performance achieved when modality-specific prompts are fused via attention. The approach offers a practical, scalable path for robust VIReID in real-world, 24/7 surveillance scenarios and suggests future work on leveraging larger multimodal models.

Abstract

Visible-infrared person re-identification (VIReID) primarily deals with matching identities across person images from different modalities. Due to the modality gap between visible and infrared images, cross-modality identity matching poses significant challenges. Recognizing that high-level semantics of pedestrian appearance, such as gender, shape, and clothing style, remain consistent across modalities, this paper intends to bridge the modality gap by infusing visual features with high-level semantics. Given the capability of CLIP to sense high-level semantic information corresponding to visual representations, we explore the application of CLIP within the domain of VIReID. Consequently, we propose a CLIP-Driven Semantic Discovery Network (CSDN) that consists of Modality-specific Prompt Learner, Semantic Information Integration (SII), and High-level Semantic Embedding (HSE). Specifically, considering the diversity stemming from modality discrepancies in language descriptions, we devise bimodal learnable text tokens to capture modality-private semantic information for visible and infrared images, respectively. Additionally, acknowledging the complementary nature of semantic details across different modalities, we integrate text features from the bimodal language descriptions to achieve comprehensive semantics. Finally, we establish a connection between the integrated text features and the visual features across modalities. This process embed rich high-level semantic information into visual representations, thereby promoting the modality invariance of visual representations. The effectiveness and superiority of our proposed CSDN over existing methods have been substantiated through experimental evaluations on multiple widely used benchmarks. The code will be released at \url{https://github.com/nengdong96/CSDN}.
Paper Structure (35 sections, 16 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 35 sections, 16 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: The core motivation of this paper. The image features of visible and infrared modalities exhibit significant modality discrepancies (see (a)), while their corresponding text features reveal no such disparities (see (b)). Consequently, employing textual features as a bridge to align visual representations across diverse modalities is deemed feasible.
  • Figure 2: Overview of the proposed method. The entire learning process of our CSDN includes three stages. Stage 1 (MsPL): Given image samples across modalities, we design bimodal prompt learners to generate natural language descriptions corresponding to visible and infrared images respectively. Stage 2 (SII): Considering the semantic complementarity of descriptions in different modalities, we devise an attention fusion module to integrate their semantic details. Stage 3 (HSE): With the guidance of the integrated rich complementary semantics, we inject the semantic information into visual representations of visible and infrared images, promoting their modality invariance.
  • Figure 3: Our idea of applying CLIP on VIReID, and we name it CLIP-VIReID. Specifically, to harness CLIP’s potent capabilities, we build a learnable language description to acquire semantic information for pairs of cross-modality images. Subsequently, we employ the obtained semantics to establish connections of visual representations across different modalities.
  • Figure 4: Visualization of spatial discriminative regions. The images are arranged from left to right in the following order: original image, heatmap obtained by Baseline, CLIP Pre-trained, CLIP-VIReID, CSDN#1, CSDN#2, and CSDN#3.
  • Figure 5: The effect analysis on different hyper-parameters $\lambda_{1}$, $\lambda_{2}$, and $\lambda_{3}$. Rank-1 accuracy and mAP are reported. Note that when one of the hyper-parameters is analyzed, the remaining two are fixed at the optimal values.