Improving Text-based Person Search via Part-level Cross-modal Correspondence
Jicheol Park, Boseung Jeong, Dongwon Kim, Suha Kwak
TL;DR
This work tackles the modality gap in text-based person search by introducing an efficient encoder-decoder that yields coarse-to-fine, semantically aligned embeddings across images and text with shared tokens. A novel commonality-based margin ranking loss $L_{CMR}$ guides the learning of fine-grained body-part details under identity supervision, enabling discriminative matching at the part level. Empirical results on three benchmarks (CUHK-PEDES, ICFG-PEDES, RSTPReid) achieve state-of-the-art performance, with ablations validating the contributions of coarse embeddings, fine embeddings, and the $L_{CMR}$ loss. The approach provides a scalable, cross-modal framework that improves robustness to background clutter and captures subtle apparel and attribute details, offering practical benefits for surveillance and search systems. Limitations include reliance on cropped person images and a potential gain from integrating detection and learnable part decomposition in future work.
Abstract
Text-based person search is the task of finding person images that are the most relevant to the natural language text description given as query. The main challenge of this task is a large gap between the target images and text queries, which makes it difficult to establish correspondence and distinguish subtle differences across people. To address this challenge, we introduce an efficient encoder-decoder model that extracts coarse-to-fine embedding vectors which are semantically aligned across the two modalities without supervision for the alignment. There is another challenge of learning to capture fine-grained information with only person IDs as supervision, where similar body parts of different individuals are considered different due to the lack of part-level supervision. To tackle this, we propose a novel ranking loss, dubbed commonality-based margin ranking loss, which quantifies the degree of commonality of each body part and reflects it during the learning of fine-grained body part details. As a consequence, it enables our method to achieve the best records on three public benchmarks.
