Table of Contents
Fetching ...

Enhancing Visual Representation for Text-based Person Searching

Wei Shen, Ming Fang, Yuxia Wang, Jiafeng Xiao, Diping Li, Huangqun Chen, Ling Xu, Weifeng Zhang

TL;DR

This work tackles text-based person search by enabling strong cross-modal alignment and resolving identity confusion through a CLIP-based backbone enhanced by two auxiliary tasks. The Text Guided Masked Image Modeling (TG-MIM) enriches local visual understanding using cross-modal guidance, while Identity Supervised Global Visual Feature Calibration (IS-GVFC) enforces identity-aware global features. Together, they allow effective transfer of multimodal knowledge from CLIP to the target task with only global alignment at inference, achieving state-of-the-art results on three benchmarks and demonstrating substantial gains over baselines. The approach offers a scalable, efficient framework for robust text-to-image person retrieval with practical impact in surveillance and security applications, while acknowledging vague-query limitations and suggesting future interactive refinement.

Abstract

Text-based person search aims to retrieve the matched pedestrians from a large-scale image database according to the text description. The core difficulty of this task is how to extract effective details from pedestrian images and texts, and achieve cross-modal alignment in a common latent space. Prior works adopt image and text encoders pre-trained on unimodal data to extract global and local features from image and text respectively, and then global-local alignment is achieved explicitly. However, these approaches still lack the ability of understanding visual details, and the retrieval accuracy is still limited by identity confusion. In order to alleviate the above problems, we rethink the importance of visual features for text-based person search, and propose VFE-TPS, a Visual Feature Enhanced Text-based Person Search model. It introduces a pre-trained multimodal backbone CLIP to learn basic multimodal features and constructs Text Guided Masked Image Modeling task to enhance the model's ability of learning local visual details without explicit annotation. In addition, we design Identity Supervised Global Visual Feature Calibration task to guide the model learn identity-aware global visual features. The key finding of our study is that, with the help of our proposed auxiliary tasks, the knowledge embedded in the pre-trained CLIP model can be successfully adapted to text-based person search task, and the model's visual understanding ability is significantly enhanced. Experimental results on three benchmarks demonstrate that our proposed model exceeds the existing approaches, and the Rank-1 accuracy is significantly improved with a notable margin of about $1\%\sim9\%$. Our code can be found at https://github.com/zhangweifeng1218/VFE_TPS.

Enhancing Visual Representation for Text-based Person Searching

TL;DR

This work tackles text-based person search by enabling strong cross-modal alignment and resolving identity confusion through a CLIP-based backbone enhanced by two auxiliary tasks. The Text Guided Masked Image Modeling (TG-MIM) enriches local visual understanding using cross-modal guidance, while Identity Supervised Global Visual Feature Calibration (IS-GVFC) enforces identity-aware global features. Together, they allow effective transfer of multimodal knowledge from CLIP to the target task with only global alignment at inference, achieving state-of-the-art results on three benchmarks and demonstrating substantial gains over baselines. The approach offers a scalable, efficient framework for robust text-to-image person retrieval with practical impact in surveillance and security applications, while acknowledging vague-query limitations and suggesting future interactive refinement.

Abstract

Text-based person search aims to retrieve the matched pedestrians from a large-scale image database according to the text description. The core difficulty of this task is how to extract effective details from pedestrian images and texts, and achieve cross-modal alignment in a common latent space. Prior works adopt image and text encoders pre-trained on unimodal data to extract global and local features from image and text respectively, and then global-local alignment is achieved explicitly. However, these approaches still lack the ability of understanding visual details, and the retrieval accuracy is still limited by identity confusion. In order to alleviate the above problems, we rethink the importance of visual features for text-based person search, and propose VFE-TPS, a Visual Feature Enhanced Text-based Person Search model. It introduces a pre-trained multimodal backbone CLIP to learn basic multimodal features and constructs Text Guided Masked Image Modeling task to enhance the model's ability of learning local visual details without explicit annotation. In addition, we design Identity Supervised Global Visual Feature Calibration task to guide the model learn identity-aware global visual features. The key finding of our study is that, with the help of our proposed auxiliary tasks, the knowledge embedded in the pre-trained CLIP model can be successfully adapted to text-based person search task, and the model's visual understanding ability is significantly enhanced. Experimental results on three benchmarks demonstrate that our proposed model exceeds the existing approaches, and the Rank-1 accuracy is significantly improved with a notable margin of about . Our code can be found at https://github.com/zhangweifeng1218/VFE_TPS.
Paper Structure (22 sections, 10 equations, 12 figures, 3 tables, 2 algorithms)

This paper contains 22 sections, 10 equations, 12 figures, 3 tables, 2 algorithms.

Figures (12)

  • Figure 1: (a) Person identity confusion: The pedestrians with same ID have lower similarity than pedestrians with different IDs. The bar chart below each image represents its visual feature. (b) Existing global-local alignment approaches, which needs explicitly conduct global and local alignment for image and text. (c) Our visual feature enhancement aided matching paradigm, which introduces auxiliary tasks to enhance visual features and only global alignment is needed. The dashed arrows and boxes are only active during training.
  • Figure 2: The overall framework of our Visual Feature Enhanced Text-based Person Search (VFE-TPS) model. The model consists of a basic feature extraction module composed of an image encoder and a text encoder, a visual feature enhancement module (including Text Guided Masked Image Modeling (TG-MIM) shown in the blue dashed box and Identity Supervised Global Visual Feature Calibration (IS-GVFC) shown in the purple dashed box), and a cross-modal global alignment module. Dashed arrows and dashed boxes indicate that these paths or modules are only active during the training stage.
  • Figure 3: Illustration of our TG-MIM and the popular SimMIM. (a) SimMIM predicts raw pixel values for masked patches based on image context. (b) Our proposed TG-MIM firstly conduct cross-modal interaction via multi-head cross attention layer, and then predict raw pixel values for masked patches based on text and image context.
  • Figure 4: Illustration of our Identity Supervised Global Visual Feature Calibration (IS-GVFC). The number on the arrow line denotes the similarity between images. IS-GVFC can reduce the difference between global visual features of pedestrian images with same identity, while expand the distance between pedestrian images with different identities.
  • Figure 5: Overall comparison between our model and SOTAs on CUHK-PEDES.
  • ...and 7 more figures