Table of Contents
Fetching ...

ICPL-ReID: Identity-Conditional Prompt Learning for Multi-Spectral Object Re-Identification

Shihao Li, Chenglong Li, Aihua Zheng, Jin Tang, Bin Luo

TL;DR

ICPL-ReID leverages CLIP with online, identity-conditional prompt learning to bridge RGB, NIR, and TIR for robust multi-spectral ReID. The framework introduces an identity-prototype guided alignment loop and a lightweight multi-spectral adapter to adapt spectral features while preserving the pre-trained image–text alignment. Through online prompts and spectral prototypes, the method achieves state-of-the-art results across five benchmarks, reducing reliance on heavy spectral fusion modules. The approach demonstrates effective cross-modal semantic transfer, improved attention to discriminative regions, and strong generalization under challenging illumination and weather conditions.

Abstract

Multi-spectral object re-identification (ReID) brings a new perception perspective for smart city and intelligent transportation applications, effectively addressing challenges from complex illumination and adverse weather. However, complex modal differences between heterogeneous spectra pose challenges to efficiently utilizing complementary and discrepancy of spectra information. Most existing methods fuse spectral data through intricate modal interaction modules, lacking fine-grained semantic understanding of spectral information (\textit{e.g.}, text descriptions, part masks, and object keypoints). To solve this challenge, we propose a novel Identity-Conditional text Prompt Learning framework (ICPL), which exploits the powerful cross-modal alignment capability of CLIP, to unify different spectral visual features from text semantics. Specifically, we first propose the online prompt learning using learnable text prompt as the identity-level semantic center to bridge the identity semantics of different spectra in online manner. Then, in lack of concrete text descriptions, we propose the multi-spectral identity-condition module to use identity prototype as spectral identity condition to constraint prompt learning. Meanwhile, we construct the alignment loop mutually optimizing the learnable text prompt and spectral visual encoder to avoid online prompt learning disrupting the pre-trained text-image alignment distribution. In addition, to adapt to small-scale multi-spectral data and mitigate style differences between spectra, we propose multi-spectral adapter that employs a low-rank adaption method to learn spectra-specific features. Comprehensive experiments on 5 benchmarks, including RGBNT201, Market-MM, MSVR310, RGBN300, and RGBNT100, demonstrate that the proposed method outperforms the state-of-the-art methods.

ICPL-ReID: Identity-Conditional Prompt Learning for Multi-Spectral Object Re-Identification

TL;DR

ICPL-ReID leverages CLIP with online, identity-conditional prompt learning to bridge RGB, NIR, and TIR for robust multi-spectral ReID. The framework introduces an identity-prototype guided alignment loop and a lightweight multi-spectral adapter to adapt spectral features while preserving the pre-trained image–text alignment. Through online prompts and spectral prototypes, the method achieves state-of-the-art results across five benchmarks, reducing reliance on heavy spectral fusion modules. The approach demonstrates effective cross-modal semantic transfer, improved attention to discriminative regions, and strong generalization under challenging illumination and weather conditions.

Abstract

Multi-spectral object re-identification (ReID) brings a new perception perspective for smart city and intelligent transportation applications, effectively addressing challenges from complex illumination and adverse weather. However, complex modal differences between heterogeneous spectra pose challenges to efficiently utilizing complementary and discrepancy of spectra information. Most existing methods fuse spectral data through intricate modal interaction modules, lacking fine-grained semantic understanding of spectral information (\textit{e.g.}, text descriptions, part masks, and object keypoints). To solve this challenge, we propose a novel Identity-Conditional text Prompt Learning framework (ICPL), which exploits the powerful cross-modal alignment capability of CLIP, to unify different spectral visual features from text semantics. Specifically, we first propose the online prompt learning using learnable text prompt as the identity-level semantic center to bridge the identity semantics of different spectra in online manner. Then, in lack of concrete text descriptions, we propose the multi-spectral identity-condition module to use identity prototype as spectral identity condition to constraint prompt learning. Meanwhile, we construct the alignment loop mutually optimizing the learnable text prompt and spectral visual encoder to avoid online prompt learning disrupting the pre-trained text-image alignment distribution. In addition, to adapt to small-scale multi-spectral data and mitigate style differences between spectra, we propose multi-spectral adapter that employs a low-rank adaption method to learn spectra-specific features. Comprehensive experiments on 5 benchmarks, including RGBNT201, Market-MM, MSVR310, RGBN300, and RGBNT100, demonstrate that the proposed method outperforms the state-of-the-art methods.

Paper Structure

This paper contains 18 sections, 12 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: (a) Classical pre-training models require ReID learning with id loss and triplet loss DBLP:conf/cvpr/0004GLL019DBLP:conf/iccv/He0WW0021. (b) The existing research introduces a two-stage text prompt learning DBLP:conf/aaai/LiSL23, which pre-aligned the text prompt for each identity and fine-tuned the ReID task with the text prompt separately. (c) Our method proposes an end-to-end text prompt learning framework, seamlessly integrates text prompt learning with multi-spectral ReID task, and alleviates the discrepancies between multi-spectral data.
  • Figure 2: Pipeline of our proposed framework. (a) For end-to-end training of multi-spectral ReID, online prompt learning leverages learnable text prompt as cross-modal constraints to jointly optimize ReID tasks. (b) The multi-spectral identity (id)-condition module first aggregates the RGB, NIR, and TIR spectral features into identity prototypes, and replaces instance features with prototypes to guide text prompt learning. The dynamically updated strategy enables the alignment of spectral-specific semantic features during training. (c) The alignment loop enables mutual optimization between the text prompts and spectral encoder. It consists of $\mathcal{L}_{prompt}$ and $\mathcal{L}_{i2p}$, where $\mathcal{L}_{t2p} + \mathcal{L}_{p2t}$ within $\mathcal{L}_{prompt}$ help the text prompt $T_{m}$ to learn the semantic information of the spectral identity prototype $u_m$. Meanwhile, the image-to-text alignment loss $\mathcal{L}_{i2t}$, guides the spectral instance $v_m$ by leveraging semantics without concrete text descriptions. (d) The multi-spectral encoder freezes and shares most parameters of the visual branch in CLIP for each spectral modality by adding a low-rank adapter to adapt spectra-specific data.
  • Figure 3: Illustration of traditional instance feature learning strategy. (a) The classic ReID metric learning method employs $\mathcal{L}_{id}$ and $\mathcal{L}_{tri}$ to enhance intra-class compactness and inter-class separability within the spectra. (b) Cross-modal semantic alignment between text and spectra is typically achieved by constructing a latent text-image alignment space with symmetric $\mathcal{L}_{i2t}$ and $\mathcal{L}_{t2i}$ losses. (c) To bring spectral features closer to the prototype and enhance the perception of global sample features within each spectral instance.
  • Figure 4: Architecture of our multi-spectral adapter.
  • Figure 5: The performance trend on mAP and Rank-1 as the number of tunable parameters grows.
  • ...and 4 more figures