Exploring Part-Informed Visual-Language Learning for Person Re-Identification

Yin Lin; Yehansen Chen; Baocai Yin; Jinshui Hu; Bing Yin; Cong Liu; Zengfu Wang

Exploring Part-Informed Visual-Language Learning for Person Re-Identification

Yin Lin, Yehansen Chen, Baocai Yin, Jinshui Hu, Bing Yin, Cong Liu, Zengfu Wang

TL;DR

The paper tackles the limitation of global image-text alignment in visual-language ReID by addressing fine-grained part semantics. It introduces Part-Informed Visual-Language Learning ($π$-VL), which combines parsing-guided pixel prompts, identity-aware part prompts, a hierarchical fusion-based alignment head, and a parsing-confidence weighted loss to enable dense pixel-level image-text alignment while keeping encoders fixed during prompt tuning. Empirical results on MSMT17 and other benchmarks show competitive performance, including 91.0% Rank-1 and 76.9% mAP on MSMT17, with improvements observed across CNN and ViT backbones and without extra inference cost. Overall, $π$-VL broadens the applicability of visual-language pre-training to fine-grained ReID, enabling robust part-level semantic alignment in a plug-and-play, inference-free framework.

Abstract

Recently, visual-language learning (VLL) has shown great potential in enhancing visual-based person re-identification (ReID). Existing VLL-based ReID methods typically focus on image-text feature alignment at the whole-body level, while neglecting supervision on fine-grained part features, thus lacking constraints for local feature semantic consistency. To this end, we propose Part-Informed Visual-language Learning ($π$-VL) to enhance fine-grained visual features with part-informed language supervisions for ReID tasks. Specifically, $π$-VL introduces a human parsing-guided prompt tuning strategy and a hierarchical visual-language alignment paradigm to ensure within-part feature semantic consistency. The former combines both identity labels and human parsing maps to constitute pixel-level text prompts, and the latter fuses multi-scale visual features with a light-weight auxiliary head to perform fine-grained image-text alignment. As a plug-and-play and inference-free solution, our $π$-VL achieves performance comparable to or better than state-of-the-art methods on four commonly used ReID benchmarks. Notably, it reports 91.0% Rank-1 and 76.9% mAP on the challenging MSMT17 database, without bells and whistles.

Exploring Part-Informed Visual-Language Learning for Person Re-Identification

TL;DR

The paper tackles the limitation of global image-text alignment in visual-language ReID by addressing fine-grained part semantics. It introduces Part-Informed Visual-Language Learning (

-VL), which combines parsing-guided pixel prompts, identity-aware part prompts, a hierarchical fusion-based alignment head, and a parsing-confidence weighted loss to enable dense pixel-level image-text alignment while keeping encoders fixed during prompt tuning. Empirical results on MSMT17 and other benchmarks show competitive performance, including 91.0% Rank-1 and 76.9% mAP on MSMT17, with improvements observed across CNN and ViT backbones and without extra inference cost. Overall,

-VL broadens the applicability of visual-language pre-training to fine-grained ReID, enabling robust part-level semantic alignment in a plug-and-play, inference-free framework.

Abstract

-VL) to enhance fine-grained visual features with part-informed language supervisions for ReID tasks. Specifically,

-VL introduces a human parsing-guided prompt tuning strategy and a hierarchical visual-language alignment paradigm to ensure within-part feature semantic consistency. The former combines both identity labels and human parsing maps to constitute pixel-level text prompts, and the latter fuses multi-scale visual features with a light-weight auxiliary head to perform fine-grained image-text alignment. As a plug-and-play and inference-free solution, our

-VL achieves performance comparable to or better than state-of-the-art methods on four commonly used ReID benchmarks. Notably, it reports 91.0% Rank-1 and 76.9% mAP on the challenging MSMT17 database, without bells and whistles.

Paper Structure (14 sections, 9 equations, 4 figures, 3 tables)

This paper contains 14 sections, 9 equations, 4 figures, 3 tables.

Introduction
Related Work
Appearance-based Person ReID
Visual-language Pre-training
Methodology
Preliminaries: Overview of CLIP-ReID
The Within-part Semantic Inconsistency Issue
Part-Informed Prompt Tuning
Part-Informed Visual-Language ReID
Experiments
Datasets and Evaluation Protocols
Comparisons with State-of-the-art Methods
Ablation Studies
Conclusion

Figures (4)

Figure 1: Comparison of CLIP-ReID li2023clip and our part-informed visual-language learning ($\pi$-VL) framework. (a) CLIP-ReID based on global image-text alignment. (b) Our $\pi$-VL based on pixel-level image-text alignment.
Figure 2: Illustration of within-part semantic inconsistency. Colors indicate different body parts, symbols denote human identities, and the red dashed line represents the decision boundary for identity recognition.
Figure 3: The proposed $\pi$-VL framework. To solve the within-part semantic inconsistency issue (Section \ref{['sec:problem']}), it first learns identity-specific and part-informed text prompts in a coarse-to-fine manner (Section \ref{['sec:prompt']}). Then it leverages a hierarchical fusion-based alignment strategy (Section \ref{['sec:align']}) to perform fine-grained image-text alignment between part-informed text embeddings and multi-scale visual features.
Figure 4: Illustration of the hierarchical image-text alignment strategy. We propose to fuse multi-scale features for image-text alignment.

Exploring Part-Informed Visual-Language Learning for Person Re-Identification

TL;DR

Abstract

Exploring Part-Informed Visual-Language Learning for Person Re-Identification

Authors

TL;DR

Abstract

Table of Contents

Figures (4)