Table of Contents
Fetching ...

PLIP: Language-Image Pre-training for Person Representation Learning

Jialong Zuo, Jiahao Hong, Feng Zhang, Changqian Yu, Hanyu Zhou, Changxin Gao, Nong Sang, Jingdong Wang

TL;DR

PLIP tackles the gap in language-image pre-training for person representation by introducing three targeted pretext tasks—Text-guided Image Colorization ($L_{tic}$), Image-guided Attributes Prediction ($L_{iap}$), and Identity-based Vision-Language Contrast ($L_{IVLC}$, comprising $L_{v2l}$ and $L_{l2v}$)—and a large synthetic dataset SYNTH-PEDES to learn fine-grained, identity-aware cross-modal representations. SYNTH-PEDES, built from LUPerson-NL and LPW with SPAC-generated attribute-rich captions, provides 312,321 identities, 4,791,711 images, and 12,138,157 descriptions for pre-training PLIP. Empirically, PLIP delivers significant improvements across downstream person-centric tasks, including text-based Re-ID, image-based Re-ID, and person search, with strong zero-shot and domain generalization performance. The work supplies practical resources (code, dataset, weights) and highlights considerations around privacy and responsible deployment in real-world applications.

Abstract

Language-image pre-training is an effective technique for learning powerful representations in general domains. However, when directly turning to person representation learning, these general pre-training methods suffer from unsatisfactory performance. The reason is that they neglect critical person-related characteristics, i.e., fine-grained attributes and identities. To address this issue, we propose a novel language-image pre-training framework for person representation learning, termed PLIP. Specifically, we elaborately design three pretext tasks: 1) Text-guided Image Colorization, aims to establish the correspondence between the person-related image regions and the fine-grained color-part textual phrases. 2) Image-guided Attributes Prediction, aims to mine fine-grained attribute information of the person body in the image; and 3) Identity-based Vision-Language Contrast, aims to correlate the cross-modal representations at the identity level rather than the instance level. Moreover, to implement our pre-train framework, we construct a large-scale person dataset with image-text pairs named SYNTH-PEDES by automatically generating textual annotations. We pre-train PLIP on SYNTH-PEDES and evaluate our models by spanning downstream person-centric tasks. PLIP not only significantly improves existing methods on all these tasks, but also shows great ability in the zero-shot and domain generalization settings. The code, dataset and weights will be released at~\url{https://github.com/Zplusdragon/PLIP}

PLIP: Language-Image Pre-training for Person Representation Learning

TL;DR

PLIP tackles the gap in language-image pre-training for person representation by introducing three targeted pretext tasks—Text-guided Image Colorization (), Image-guided Attributes Prediction (), and Identity-based Vision-Language Contrast (, comprising and )—and a large synthetic dataset SYNTH-PEDES to learn fine-grained, identity-aware cross-modal representations. SYNTH-PEDES, built from LUPerson-NL and LPW with SPAC-generated attribute-rich captions, provides 312,321 identities, 4,791,711 images, and 12,138,157 descriptions for pre-training PLIP. Empirically, PLIP delivers significant improvements across downstream person-centric tasks, including text-based Re-ID, image-based Re-ID, and person search, with strong zero-shot and domain generalization performance. The work supplies practical resources (code, dataset, weights) and highlights considerations around privacy and responsible deployment in real-world applications.

Abstract

Language-image pre-training is an effective technique for learning powerful representations in general domains. However, when directly turning to person representation learning, these general pre-training methods suffer from unsatisfactory performance. The reason is that they neglect critical person-related characteristics, i.e., fine-grained attributes and identities. To address this issue, we propose a novel language-image pre-training framework for person representation learning, termed PLIP. Specifically, we elaborately design three pretext tasks: 1) Text-guided Image Colorization, aims to establish the correspondence between the person-related image regions and the fine-grained color-part textual phrases. 2) Image-guided Attributes Prediction, aims to mine fine-grained attribute information of the person body in the image; and 3) Identity-based Vision-Language Contrast, aims to correlate the cross-modal representations at the identity level rather than the instance level. Moreover, to implement our pre-train framework, we construct a large-scale person dataset with image-text pairs named SYNTH-PEDES by automatically generating textual annotations. We pre-train PLIP on SYNTH-PEDES and evaluate our models by spanning downstream person-centric tasks. PLIP not only significantly improves existing methods on all these tasks, but also shows great ability in the zero-shot and domain generalization settings. The code, dataset and weights will be released at~\url{https://github.com/Zplusdragon/PLIP}
Paper Structure (39 sections, 13 equations, 12 figures, 21 tables, 1 algorithm)

This paper contains 39 sections, 13 equations, 12 figures, 21 tables, 1 algorithm.

Figures (12)

  • Figure 1: Illumination of our framework. Based on the constructed dataset, we pre-train a language-image model by three pretext tasks and transfer the model to some downstream person-centric tasks.
  • Figure 2: Overview of our proposed framework incorporating a text-guided image colorization task, an image-guided attributes prediction task and an identity-based vision-language contrast task.
  • Figure 3: Visualization of some examples in our SYNTH-PEDES dataset.
  • Figure 4: The diversity of textual descriptions matters. PC and GC mean prompt caption and generated caption, respectively.
  • Figure 5: Visualization of gray-scale person image colorization results by changing the color words in textual descriptions.
  • ...and 7 more figures