Prototypical Prompting for Text-to-image Person Re-identification

Shuanglin Yan; Jun Liu; Neng Dong; Liyan Zhang; Jinhui Tang

Prototypical Prompting for Text-to-image Person Re-identification

Shuanglin Yan, Jun Liu, Neng Dong, Liyan Zhang, Jinhui Tang

TL;DR

This paper proposes a novel prototypical prompting framework (Propot) designed to simultaneously model instance-level and identity-level matching for TIReID, and transforms the identity-level matching problem into a prototype learning problem, aiming to learn identity-enriched prototypes.

Abstract

In this paper, we study the problem of Text-to-Image Person Re-identification (TIReID), which aims to find images of the same identity described by a text sentence from a pool of candidate images. Benefiting from Vision-Language Pre-training, such as CLIP (Contrastive Language-Image Pretraining), the TIReID techniques have achieved remarkable progress recently. However, most existing methods only focus on instance-level matching and ignore identity-level matching, which involves associating multiple images and texts belonging to the same person. In this paper, we propose a novel prototypical prompting framework (Propot) designed to simultaneously model instance-level and identity-level matching for TIReID. Our Propot transforms the identity-level matching problem into a prototype learning problem, aiming to learn identity-enriched prototypes. Specifically, Propot works by 'initialize, adapt, enrich, then aggregate'. We first use CLIP to generate high-quality initial prototypes. Then, we propose a domain-conditional prototypical prompting (DPP) module to adapt the prototypes to the TIReID task using task-related information. Further, we propose an instance-conditional prototypical prompting (IPP) module to update prototypes conditioned on intra-modal and inter-modal instances to ensure prototype diversity. Finally, we design an adaptive prototype aggregation module to aggregate these prototypes, generating final identity-enriched prototypes. With identity-enriched prototypes, we diffuse its rich identity information to instances through prototype-to-instance contrastive loss to facilitate identity-level matching. Extensive experiments conducted on three benchmarks demonstrate the superiority of Propot compared to existing TIReID methods.

Prototypical Prompting for Text-to-image Person Re-identification

TL;DR

Abstract

Paper Structure (17 sections, 13 equations, 4 figures, 5 tables)

This paper contains 17 sections, 13 equations, 4 figures, 5 tables.

Introduction
Related Work
Text-to-Image Person Re-identification
Vision-Language Pre-Training
Prompt Learning
The Propot Framework
Feature Extraction
Initial Prototype Generation
Domain-conditional Prototypical Prompting
Instance-conditional Prototypical Prompting
Adaptive Prototype Aggregation
Training and Inference
Experiments
Experiment Settings
Comparisons with State-of-the-art Models
...and 2 more sections

Figures (4)

Figure 1: The motivation of Propot. (a) Instances under the same identity show significant differences. (b) Current TIReID methods only focus on instance-level matching and ignore identity-level matching. (c) Our Propot proposes a prototype prompting framework to create identity-enriched prototypes and diffuse their rich identity information to instances for modeling identity-level matching.
Figure 2: Overview of our Propot. It includes instance-level matching and identity-enriched prototype learning. For instance-level matching, each image and its annotated text are directly aligned through SDM loss (Baseline). For prototype learning, we first utilize pre-trained CLIP to generate initial prototypes ($\bm {pt}^v$ and $\bm {pt}^t$). We then adapt the initial prototypes to TIReID through the DPP module, resulting in task-adapted prototypes ($\bm {p}_a^v$ and $\bm {p}_a^t$). The IPP module updates the prototypes conditioned on a batch of intra-modal and inter-modal instances, generating intra-modal and inter-modal enriched prototypes ($\bm {p}_{en}^v$, $\bm {p}_{en}^t$, $\bm {p}_{eo}^v$ and $\bm {p}_{eo}^t$). The multiple prototypes are aggregated using Adaptive Prototypical Aggregation (APA) to generate final prototypes ($\bm {p}^v$ and $\bm {p}^t$). Their rich identity information is then diffused to each instance using prototype-to-instance contrastive loss ($\mathcal{L}_{p2v}$, $\mathcal{L}_{p2t}$) to model identity-level matching. Moreover, we introduce the MLM module to enhance fine-grained matching. During testing, only visual and textual encoders are used for inference.
Figure 3: Effects of four hyper-parameters on CUHK-PEDES, including contextual vector length $K$, the block number $N_a, N_e$, and loss weight $\lambda_1$.
Figure 4: Retrieval result comparisons of Baseline (the 1st row) and Propot (the 2nd row) on CUHK-PEDES. The matched and mismatched person images are marked with green and red rectangles, respectively.

Prototypical Prompting for Text-to-image Person Re-identification

TL;DR

Abstract

Prototypical Prompting for Text-to-image Person Re-identification

Authors

TL;DR

Abstract

Table of Contents

Figures (4)