Table of Contents
Fetching ...

Embedding and Enriching Explicit Semantics for Visible-Infrared Person Re-Identification

Neng Dong, Shuanglin Yan, Liyan Zhang, Jinhui Tang

TL;DR

This work tackles visible-infrared person re-identification by moving beyond image-only features to embeddings enriched with explicit semantics. The proposed EEES framework combines Explicit Semantics Embedding, Cross-View Semantics Compensation, and Cross-Modality Semantics Purification to align cross-modality data, fuse multi-view information, and suppress noisy semantics, all in an end-to-end trainable system. By leveraging LLaVA-generated descriptions and CLIP-based alignment, plus multi-view knowledge distillation, EEES achieves state-of-the-art performance on SYSU-MM01 and RegDB, demonstrating significant gains over both generative-based and generative-free VIReID methods. This approach offers a practical path toward more robust, semantically grounded cross-modal re-identification with strong potential for real-world surveillance applications.

Abstract

Visible-infrared person re-identification (VIReID) retrieves pedestrian images with the same identity across different modalities. Existing methods learn visual content solely from images, lacking the capability to sense high-level semantics. In this paper, we propose an Embedding and Enriching Explicit Semantics (EEES) framework to learn semantically rich cross-modality pedestrian representations. Our method offers several contributions. First, with the collaboration of multiple large language-vision models, we develop Explicit Semantics Embedding (ESE), which automatically supplements language descriptions for pedestrians and aligns image-text pairs into a common space, thereby learning visual content associated with explicit semantics. Second, recognizing the complementarity of multi-view information, we present Cross-View Semantics Compensation (CVSC), which constructs multi-view image-text pair representations, establishes their many-to-many matching, and propagates knowledge to single-view representations, thus compensating visual content with its missing cross-view semantics. Third, to eliminate noisy semantics such as conflicting color attributes in different modalities, we design Cross-Modality Semantics Purification (CMSP), which constrains the distance between inter-modality image-text pair representations to be close to that between intra-modality image-text pair representations, further enhancing the modality-invariance of visual content. Finally, experimental results demonstrate the effectiveness and superiority of the proposed EEES.

Embedding and Enriching Explicit Semantics for Visible-Infrared Person Re-Identification

TL;DR

This work tackles visible-infrared person re-identification by moving beyond image-only features to embeddings enriched with explicit semantics. The proposed EEES framework combines Explicit Semantics Embedding, Cross-View Semantics Compensation, and Cross-Modality Semantics Purification to align cross-modality data, fuse multi-view information, and suppress noisy semantics, all in an end-to-end trainable system. By leveraging LLaVA-generated descriptions and CLIP-based alignment, plus multi-view knowledge distillation, EEES achieves state-of-the-art performance on SYSU-MM01 and RegDB, demonstrating significant gains over both generative-based and generative-free VIReID methods. This approach offers a practical path toward more robust, semantically grounded cross-modal re-identification with strong potential for real-world surveillance applications.

Abstract

Visible-infrared person re-identification (VIReID) retrieves pedestrian images with the same identity across different modalities. Existing methods learn visual content solely from images, lacking the capability to sense high-level semantics. In this paper, we propose an Embedding and Enriching Explicit Semantics (EEES) framework to learn semantically rich cross-modality pedestrian representations. Our method offers several contributions. First, with the collaboration of multiple large language-vision models, we develop Explicit Semantics Embedding (ESE), which automatically supplements language descriptions for pedestrians and aligns image-text pairs into a common space, thereby learning visual content associated with explicit semantics. Second, recognizing the complementarity of multi-view information, we present Cross-View Semantics Compensation (CVSC), which constructs multi-view image-text pair representations, establishes their many-to-many matching, and propagates knowledge to single-view representations, thus compensating visual content with its missing cross-view semantics. Third, to eliminate noisy semantics such as conflicting color attributes in different modalities, we design Cross-Modality Semantics Purification (CMSP), which constrains the distance between inter-modality image-text pair representations to be close to that between intra-modality image-text pair representations, further enhancing the modality-invariance of visual content. Finally, experimental results demonstrate the effectiveness and superiority of the proposed EEES.

Paper Structure

This paper contains 21 sections, 13 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The core motivation of our EEES framework arises from several key observations: (a) Language descriptions produced by off-the-shelf large language-vision generation models surpass learnable prompts in clarity and detail. (b) Multi-view images/texts exhibit significant complementary attributes. (c) Noise information such as color clues leads to semantic conflicts between paired cross-modality images.
  • Figure 2: Overview of our EEES. It comprises ESE, CVSC, and CMSP. ESC supplements language descriptions for images and aligns image-text pairs into a common space. CVSC fuses image (text) features with the same identity across different views, establishes correspondences between multi-view image-text pair representations, and transfers knowledge from multi-view representations to single-view ones. CMSP constrains the distance between inter-modality image-text pair representations to be close to that of intra-modality image-text pair representations. During inference, only the visual side is used.
  • Figure 3: Parameters analysis of $\lambda_{1}$, $\lambda_{2}$, $\lambda_{3}$, and $\lambda_{4}$.
  • Figure 4: Visualization of spatial discriminative regions. From left to right, the images are arranged as follows: the original image, followed by heatmaps of Baseline, ISE, ESE, ESE+CVSC, ESE+CMSP, and EEES.