Table of Contents
Fetching ...

Improving Text-based Person Search via Part-level Cross-modal Correspondence

Jicheol Park, Boseung Jeong, Dongwon Kim, Suha Kwak

TL;DR

This work tackles the modality gap in text-based person search by introducing an efficient encoder-decoder that yields coarse-to-fine, semantically aligned embeddings across images and text with shared tokens. A novel commonality-based margin ranking loss $L_{CMR}$ guides the learning of fine-grained body-part details under identity supervision, enabling discriminative matching at the part level. Empirical results on three benchmarks (CUHK-PEDES, ICFG-PEDES, RSTPReid) achieve state-of-the-art performance, with ablations validating the contributions of coarse embeddings, fine embeddings, and the $L_{CMR}$ loss. The approach provides a scalable, cross-modal framework that improves robustness to background clutter and captures subtle apparel and attribute details, offering practical benefits for surveillance and search systems. Limitations include reliance on cropped person images and a potential gain from integrating detection and learnable part decomposition in future work.

Abstract

Text-based person search is the task of finding person images that are the most relevant to the natural language text description given as query. The main challenge of this task is a large gap between the target images and text queries, which makes it difficult to establish correspondence and distinguish subtle differences across people. To address this challenge, we introduce an efficient encoder-decoder model that extracts coarse-to-fine embedding vectors which are semantically aligned across the two modalities without supervision for the alignment. There is another challenge of learning to capture fine-grained information with only person IDs as supervision, where similar body parts of different individuals are considered different due to the lack of part-level supervision. To tackle this, we propose a novel ranking loss, dubbed commonality-based margin ranking loss, which quantifies the degree of commonality of each body part and reflects it during the learning of fine-grained body part details. As a consequence, it enables our method to achieve the best records on three public benchmarks.

Improving Text-based Person Search via Part-level Cross-modal Correspondence

TL;DR

This work tackles the modality gap in text-based person search by introducing an efficient encoder-decoder that yields coarse-to-fine, semantically aligned embeddings across images and text with shared tokens. A novel commonality-based margin ranking loss guides the learning of fine-grained body-part details under identity supervision, enabling discriminative matching at the part level. Empirical results on three benchmarks (CUHK-PEDES, ICFG-PEDES, RSTPReid) achieve state-of-the-art performance, with ablations validating the contributions of coarse embeddings, fine embeddings, and the loss. The approach provides a scalable, cross-modal framework that improves robustness to background clutter and captures subtle apparel and attribute details, offering practical benefits for surveillance and search systems. Limitations include reliance on cropped person images and a potential gain from integrating detection and learnable part decomposition in future work.

Abstract

Text-based person search is the task of finding person images that are the most relevant to the natural language text description given as query. The main challenge of this task is a large gap between the target images and text queries, which makes it difficult to establish correspondence and distinguish subtle differences across people. To address this challenge, we introduce an efficient encoder-decoder model that extracts coarse-to-fine embedding vectors which are semantically aligned across the two modalities without supervision for the alignment. There is another challenge of learning to capture fine-grained information with only person IDs as supervision, where similar body parts of different individuals are considered different due to the lack of part-level supervision. To tackle this, we propose a novel ranking loss, dubbed commonality-based margin ranking loss, which quantifies the degree of commonality of each body part and reflects it during the learning of fine-grained body part details. As a consequence, it enables our method to achieve the best records on three public benchmarks.
Paper Structure (17 sections, 14 equations, 7 figures, 7 tables)

This paper contains 17 sections, 14 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The overall pipeline of our method. Given an image and a text description of a person as input, the corresponding backbone models (ResNet50 and BERT) extract visual and textual features. Global embeddings of the image and the text description ($\mathbf{g}_v$ and $\mathbf{g}_t$) are obtained by global max pooling over the visual and textual features. Coarse embeddings of the image and text modalities ($\mathbf{c}_v$ and $\mathbf{c}_t$) are produced by an encoder-decoder taking the visual and textual features, and a set of learnable tokens as input. Fine embeddings of the image ($\mathbf{f}_v$) are obtained by horizontally dividing the feature map. Meanwhile, the fine embeddings of the text description ($\mathbf{f}_t$) are extracted by the decoder with another set of tokens, namely text tokens. The global and coarse embeddings of both image and text description are aligned by conventional identity classification loss ($\ell_\textrm{ID}$) and ranking loss ($\ell_\textrm{R}$). On the other hand, the fine embeddings are aligned by identity classification loss and commonality-based margin ranking loss ($\ell_\textrm{CMR}$).
  • Figure 2: A conceptual illustration of the Comanality-based Margin Ranking (CMR) loss function in Eq. \ref{['eq:cmr loss']}. The square and circle symbols denote the fine embeddings of image and text modalities, respectively.
  • Figure 3: Qualitative results of our method on the CUHK-PEDES dataset. Query texts and the retrieval results of our method for successful cases are presented, while the failure case of our method presents a query text, its ground truth, and the top 3 retrieval results. The true and false matches are colored green and red, respectively.
  • Figure 4: Visualization of a cross-attention map of (a) visual and (b) textual features from the decoder.
  • Figure A: Qualitative results of our method on the CUHK-PEDES dataset. Query texts and the top-5 retrieval results of our method are presented, The true and false matches are colored green and red, respectively.
  • ...and 2 more figures