Table of Contents
Fetching ...

From Attributes to Natural Language: A Survey and Foresight on Text-based Person Re-identification

Fanzhi Jiang, Su Yang, Mark W. Jones, Liumei Zhang

TL;DR

This survey offers a structured, cross-section analysis of text-based person re-identification, distinguishing attribute-based and natural-language-based approaches and organizing methods across Evaluation, Strategy, Architecture, and Optimization. It catalogs key NL-based datasets (CUHK-PEDES, RSTPReid, ICFG-PEDES) and attribute datasets, outlines core strategies (stripe segmentation, multi-scale fusion, attention, external cues), and reviews architectural families (CNNs, RNNs, autoencoders, GNNs, transformers) with prevalent loss functions (IDE, verification, contrastive, triplet, quadruplet, adversarial). The paper highlights challenges such as language granularity and real-world data variability, and forwards a foresight section introducing a diffusion-based, generation-guided open-set Re-ID baseline (TBPGR) to address open-world retrieval. Overall, it provides a comprehensive, technical roadmap for advancing text-to-image cross-modal Re-ID and practical deployment scenarios. The findings underscore the potential of large multimodal models and generative approaches to bridge modality gaps and enable robust, scalable search by description.

Abstract

Text-based person re-identification (Re-ID) is a challenging topic in the field of complex multimodal analysis, its ultimate aim is to recognize specific pedestrians by scrutinizing attributes/natural language descriptions. Despite the wide range of applicable areas such as security surveillance, video retrieval, person tracking, and social media analytics, there is a notable absence of comprehensive reviews dedicated to summarizing the text-based person Re-ID from a technical perspective. To address this gap, we propose to introduce a taxonomy spanning Evaluation, Strategy, Architecture, and Optimization dimensions, providing a comprehensive survey of the text-based person Re-ID task. We start by laying the groundwork for text-based person Re-ID, elucidating fundamental concepts related to attribute/natural language-based identification. Then a thorough examination of existing benchmark datasets and metrics is presented. Subsequently, we further delve into prevalent feature extraction strategies employed in text-based person Re-ID research, followed by a concise summary of common network architectures within the domain. Prevalent loss functions utilized for model optimization and modality alignment in text-based person Re-ID are also scrutinized. To conclude, we offer a concise summary of our findings, pinpointing challenges in text-based person Re-ID. In response to these challenges, we outline potential avenues for future open-set text-based person Re-ID and present a baseline architecture for text-based pedestrian image generation-guided re-identification(TBPGR).

From Attributes to Natural Language: A Survey and Foresight on Text-based Person Re-identification

TL;DR

This survey offers a structured, cross-section analysis of text-based person re-identification, distinguishing attribute-based and natural-language-based approaches and organizing methods across Evaluation, Strategy, Architecture, and Optimization. It catalogs key NL-based datasets (CUHK-PEDES, RSTPReid, ICFG-PEDES) and attribute datasets, outlines core strategies (stripe segmentation, multi-scale fusion, attention, external cues), and reviews architectural families (CNNs, RNNs, autoencoders, GNNs, transformers) with prevalent loss functions (IDE, verification, contrastive, triplet, quadruplet, adversarial). The paper highlights challenges such as language granularity and real-world data variability, and forwards a foresight section introducing a diffusion-based, generation-guided open-set Re-ID baseline (TBPGR) to address open-world retrieval. Overall, it provides a comprehensive, technical roadmap for advancing text-to-image cross-modal Re-ID and practical deployment scenarios. The findings underscore the potential of large multimodal models and generative approaches to bridge modality gaps and enable robust, scalable search by description.

Abstract

Text-based person re-identification (Re-ID) is a challenging topic in the field of complex multimodal analysis, its ultimate aim is to recognize specific pedestrians by scrutinizing attributes/natural language descriptions. Despite the wide range of applicable areas such as security surveillance, video retrieval, person tracking, and social media analytics, there is a notable absence of comprehensive reviews dedicated to summarizing the text-based person Re-ID from a technical perspective. To address this gap, we propose to introduce a taxonomy spanning Evaluation, Strategy, Architecture, and Optimization dimensions, providing a comprehensive survey of the text-based person Re-ID task. We start by laying the groundwork for text-based person Re-ID, elucidating fundamental concepts related to attribute/natural language-based identification. Then a thorough examination of existing benchmark datasets and metrics is presented. Subsequently, we further delve into prevalent feature extraction strategies employed in text-based person Re-ID research, followed by a concise summary of common network architectures within the domain. Prevalent loss functions utilized for model optimization and modality alignment in text-based person Re-ID are also scrutinized. To conclude, we offer a concise summary of our findings, pinpointing challenges in text-based person Re-ID. In response to these challenges, we outline potential avenues for future open-set text-based person Re-ID and present a baseline architecture for text-based pedestrian image generation-guided re-identification(TBPGR).
Paper Structure (36 sections, 10 equations, 8 figures, 4 tables)

This paper contains 36 sections, 10 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Conceptual diagram of text-based person Re-ID. Given a textual description of a target person collected from a street witness, a monitor uses the model aimed at retrieving the corresponding person image from a given database of images collected from street CCTV.
  • Figure 2: Multi-scale feature fusion: The figure shows the basic operations regarding the fusion of image and text multi-scale features in person re-identification.
  • Figure 3: The figure shows the principles of the most prominent alignment loss methods in text-based person re-identification. These include identity loss, verification loss, comparison loss, triplet loss, quadruplet loss and adversarial loss. The vast majority of new alignment methods being devised are currently based on variants or combinations of these losses.
  • Figure 4: Text-Based Pedestrian Image Generation Guided Re-ID baseline architecture. The entire architecture consists of three modules: A) Text-based pedestrian generator, B) Pedestrian refinement module, and C) Pedestrian re-recognition system. The entire architectural flow can be viewed starting with module A at the bottom left, preferably in a clockwise direction. Note that there may not be some kind of very clear boundary between modules A and B, and that there is interaction between them in the pedestrian generation process.
  • Figure 5: Generated images of text-based pedestrians without and with fine-tuning are shown, along with their retrieval Rank@10 results. The green numbers shown at the top of the image represent a positive retrieval and the red color represents a negative retrieval.
  • ...and 3 more figures