Table of Contents
Fetching ...

ShapeSpeak: Body Shape-Aware Textual Alignment for Visible-Infrared Person Re-Identification

Shuanglin Yan, Neng Dong, Shuang Li, Rui Yan, Hao Tang, Jing Qin

TL;DR

This work introduces Propot, an end-to-end prototypical prompting framework for text-to-image person re-identification (TIReID) that jointly optimizes instance-level and identity-level cross-modal matching. Propot generates identity-aware prototypes from CLIP features and refines them through domain- and instance-conditioned prompting (DPP and IPP), followed by adaptive prototype aggregation (APA) to diffuse rich identity information to individual samples via prototype-to-instance contrastive learning and MLM. The approach leverages CLIP as a strong multi-modal prior and demonstrates strong results on three TIReID benchmarks, outperforming many prior methods while maintaining efficiency. The findings highlight the value of explicit identity-level modeling in TIReID and establish a practical, scalable framework for leveraging pre-trained vision-language models in cross-modal re-identification tasks.

Abstract

Visible-Infrared Person Re-identification (VIReID) aims to match visible and infrared pedestrian images, but the modality differences and the complexity of identity features make it challenging. Existing methods rely solely on identity label supervision, which makes it difficult to fully extract high-level semantic information. Recently, vision-language pre-trained models have been introduced to VIReID, enhancing semantic information modeling by generating textual descriptions. However, such methods do not explicitly model body shape features, which are crucial for cross-modal matching. To address this, we propose an effective Body Shape-aware Textual Alignment (BSaTa) framework that explicitly models and utilizes body shape information to improve VIReID performance. Specifically, we design a Body Shape Textual Alignment (BSTA) module that extracts body shape information using a human parsing model and converts it into structured text representations via CLIP. We also design a Text-Visual Consistency Regularizer (TVCR) to ensure alignment between body shape textual representations and visual body shape features. Furthermore, we introduce a Shape-aware Representation Learning (SRL) mechanism that combines Multi-text Supervision and Distribution Consistency Constraints to guide the visual encoder to learn modality-invariant and discriminative identity features, thus enhancing modality invariance. Experimental results demonstrate that our method achieves superior performance on the SYSU-MM01 and RegDB datasets, validating its effectiveness.

ShapeSpeak: Body Shape-Aware Textual Alignment for Visible-Infrared Person Re-Identification

TL;DR

This work introduces Propot, an end-to-end prototypical prompting framework for text-to-image person re-identification (TIReID) that jointly optimizes instance-level and identity-level cross-modal matching. Propot generates identity-aware prototypes from CLIP features and refines them through domain- and instance-conditioned prompting (DPP and IPP), followed by adaptive prototype aggregation (APA) to diffuse rich identity information to individual samples via prototype-to-instance contrastive learning and MLM. The approach leverages CLIP as a strong multi-modal prior and demonstrates strong results on three TIReID benchmarks, outperforming many prior methods while maintaining efficiency. The findings highlight the value of explicit identity-level modeling in TIReID and establish a practical, scalable framework for leveraging pre-trained vision-language models in cross-modal re-identification tasks.

Abstract

Visible-Infrared Person Re-identification (VIReID) aims to match visible and infrared pedestrian images, but the modality differences and the complexity of identity features make it challenging. Existing methods rely solely on identity label supervision, which makes it difficult to fully extract high-level semantic information. Recently, vision-language pre-trained models have been introduced to VIReID, enhancing semantic information modeling by generating textual descriptions. However, such methods do not explicitly model body shape features, which are crucial for cross-modal matching. To address this, we propose an effective Body Shape-aware Textual Alignment (BSaTa) framework that explicitly models and utilizes body shape information to improve VIReID performance. Specifically, we design a Body Shape Textual Alignment (BSTA) module that extracts body shape information using a human parsing model and converts it into structured text representations via CLIP. We also design a Text-Visual Consistency Regularizer (TVCR) to ensure alignment between body shape textual representations and visual body shape features. Furthermore, we introduce a Shape-aware Representation Learning (SRL) mechanism that combines Multi-text Supervision and Distribution Consistency Constraints to guide the visual encoder to learn modality-invariant and discriminative identity features, thus enhancing modality invariance. Experimental results demonstrate that our method achieves superior performance on the SYSU-MM01 and RegDB datasets, validating its effectiveness.

Paper Structure

This paper contains 17 sections, 13 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The motivation of our proposed Propot. (a) Some examples of TIReID data containing multiple images from two identities and their annotated texts. Instances under the same identity are significantly different. (b) Most existing TIReID methods only focus on instance-level matching and ignore identity-level matching. (c) Our Propot proposes a prototype prompting framework to produce identity-enriched prototypes and diffuse their rich identity information to instances for modeling identity-level matching.
  • Figure 2: Overview of our Propot. It includes instance-level matching and identity-enriched prototype learning. For instance-level matching, each image and its annotated text are directly aligned through SDM loss (Baseline). For prototype learning, we first utilize pre-trained CLIP to generate the initial prototypes ($\bm {pt}^v$ and $\bm {pt}^t$). We then adapt the initial prototypes to TIReID through the DPP module to generate the task-adapted prototypes ($\bm {p}_a^v$ and $\bm {p}_a^t$). And the IPP module updates the prototypes conditioned on a batch of intra-modal and inter-modal instances to generate intra-modal and inter-modal enriched prototypes ($\bm {p}_{en}^v$, $\bm {p}_{en}^t$, $\bm {p}_{eo}^v$ and $\bm {p}_{eo}^t$), respectively. The above multiple prototypes are aggregated through Adaptive Prototypical Aggregation (APA) to generate the final prototypes ($\bm {p}^v$ and $\bm {p}^t$), and their rich identity information is diffused to each instance through prototype-to-instance contrastive loss ($\mathcal{L}_{p2v}$, $\mathcal{L}_{p2t}$) to model identity-level matching. Moreover, we also introduce the MLM module as compensation to model fine-grained matching. During testing, only visual and textual encoders are used for inference.
  • Figure 3: Effects of four hyper-parameters on CUHK-PEDES, including contextual vector length $K$, the block number $N_a, N_e$, and loss weight $\lambda_1$.
  • Figure 4: Retrieval result comparisons of Baseline and our Propot on CUHK-PEDES. The matched and mismatched person images are marked with green and red rectangles, respectively.