From Attributes to Natural Language: A Survey and Foresight on Text-based Person Re-identification
Fanzhi Jiang, Su Yang, Mark W. Jones, Liumei Zhang
TL;DR
This survey offers a structured, cross-section analysis of text-based person re-identification, distinguishing attribute-based and natural-language-based approaches and organizing methods across Evaluation, Strategy, Architecture, and Optimization. It catalogs key NL-based datasets (CUHK-PEDES, RSTPReid, ICFG-PEDES) and attribute datasets, outlines core strategies (stripe segmentation, multi-scale fusion, attention, external cues), and reviews architectural families (CNNs, RNNs, autoencoders, GNNs, transformers) with prevalent loss functions (IDE, verification, contrastive, triplet, quadruplet, adversarial). The paper highlights challenges such as language granularity and real-world data variability, and forwards a foresight section introducing a diffusion-based, generation-guided open-set Re-ID baseline (TBPGR) to address open-world retrieval. Overall, it provides a comprehensive, technical roadmap for advancing text-to-image cross-modal Re-ID and practical deployment scenarios. The findings underscore the potential of large multimodal models and generative approaches to bridge modality gaps and enable robust, scalable search by description.
Abstract
Text-based person re-identification (Re-ID) is a challenging topic in the field of complex multimodal analysis, its ultimate aim is to recognize specific pedestrians by scrutinizing attributes/natural language descriptions. Despite the wide range of applicable areas such as security surveillance, video retrieval, person tracking, and social media analytics, there is a notable absence of comprehensive reviews dedicated to summarizing the text-based person Re-ID from a technical perspective. To address this gap, we propose to introduce a taxonomy spanning Evaluation, Strategy, Architecture, and Optimization dimensions, providing a comprehensive survey of the text-based person Re-ID task. We start by laying the groundwork for text-based person Re-ID, elucidating fundamental concepts related to attribute/natural language-based identification. Then a thorough examination of existing benchmark datasets and metrics is presented. Subsequently, we further delve into prevalent feature extraction strategies employed in text-based person Re-ID research, followed by a concise summary of common network architectures within the domain. Prevalent loss functions utilized for model optimization and modality alignment in text-based person Re-ID are also scrutinized. To conclude, we offer a concise summary of our findings, pinpointing challenges in text-based person Re-ID. In response to these challenges, we outline potential avenues for future open-set text-based person Re-ID and present a baseline architecture for text-based pedestrian image generation-guided re-identification(TBPGR).
