Table of Contents
Fetching ...

Semi-supervised Text-based Person Search

Daming Gao, Yang Bai, Min Cao, Hao Dou, Mang Ye, Min Zhang

TL;DR

This paper tackles semi-supervised text-based person search (TBPS) by proposing a generation-then-retrieval pipeline that leverages limited image-text annotations to generate pseudo-texts for unlabeled images and train a retrieval model on an expanded dataset. To address the inevitable noise from pseudo-texts, it introduces a noise-robust retrieval framework combining Hybrid Patch-Channel Masking (PC-Mask) and Noise-Guided Progressive Training (NP-Train), plus a Noise Measurer and a linear training scheduler to progressively utilize data by estimated semantic noise. Extensive experiments on CTBPES benchmarks (CUHK-PEDES, ICFG-PEDES, RSTPReid, UFine6926) show consistent gains from Basic-GTR over baselines and additional improvements from PC-Mask and NP-Train, with strong performance under 1–20% labeled data and favorable cost-performance trade-offs. The work advances practical TBPS by reducing annotation demands and offering a robust learning paradigm for noisy cross-modal supervision, with potential extensions to generation quality and more sophisticated curriculum strategies.

Abstract

Text-based person search (TBPS) aims to retrieve images of a specific person from a large image gallery based on a natural language description. Existing methods rely on massive annotated image-text data to achieve satisfactory performance in fully-supervised learning. It poses a significant challenge in practice, as acquiring person images from surveillance videos is relatively easy, while obtaining annotated texts is challenging. The paper undertakes a pioneering initiative to explore TBPS under the semi-supervised setting, where only a limited number of person images are annotated with textual descriptions while the majority of images lack annotations. We present a two-stage basic solution based on generation-then-retrieval for semi-supervised TBPS. The generation stage enriches annotated data by applying an image captioning model to generate pseudo-texts for unannotated images. Later, the retrieval stage performs fully-supervised retrieval learning using the augmented data. Significantly, considering the noise interference of the pseudo-texts on retrieval learning, we propose a noise-robust retrieval framework that enhances the ability of the retrieval model to handle noisy data. The framework integrates two key strategies: Hybrid Patch-Channel Masking (PC-Mask) to refine the model architecture, and Noise-Guided Progressive Training (NP-Train) to enhance the training process. PC-Mask performs masking on the input data at both the patch-level and the channel-level to prevent overfitting noisy supervision. NP-Train introduces a progressive training schedule based on the noise level of pseudo-texts to facilitate noise-robust learning. Extensive experiments on multiple TBPS benchmarks show that the proposed framework achieves promising performance under the semi-supervised setting.

Semi-supervised Text-based Person Search

TL;DR

This paper tackles semi-supervised text-based person search (TBPS) by proposing a generation-then-retrieval pipeline that leverages limited image-text annotations to generate pseudo-texts for unlabeled images and train a retrieval model on an expanded dataset. To address the inevitable noise from pseudo-texts, it introduces a noise-robust retrieval framework combining Hybrid Patch-Channel Masking (PC-Mask) and Noise-Guided Progressive Training (NP-Train), plus a Noise Measurer and a linear training scheduler to progressively utilize data by estimated semantic noise. Extensive experiments on CTBPES benchmarks (CUHK-PEDES, ICFG-PEDES, RSTPReid, UFine6926) show consistent gains from Basic-GTR over baselines and additional improvements from PC-Mask and NP-Train, with strong performance under 1–20% labeled data and favorable cost-performance trade-offs. The work advances practical TBPS by reducing annotation demands and offering a robust learning paradigm for noisy cross-modal supervision, with potential extensions to generation quality and more sophisticated curriculum strategies.

Abstract

Text-based person search (TBPS) aims to retrieve images of a specific person from a large image gallery based on a natural language description. Existing methods rely on massive annotated image-text data to achieve satisfactory performance in fully-supervised learning. It poses a significant challenge in practice, as acquiring person images from surveillance videos is relatively easy, while obtaining annotated texts is challenging. The paper undertakes a pioneering initiative to explore TBPS under the semi-supervised setting, where only a limited number of person images are annotated with textual descriptions while the majority of images lack annotations. We present a two-stage basic solution based on generation-then-retrieval for semi-supervised TBPS. The generation stage enriches annotated data by applying an image captioning model to generate pseudo-texts for unannotated images. Later, the retrieval stage performs fully-supervised retrieval learning using the augmented data. Significantly, considering the noise interference of the pseudo-texts on retrieval learning, we propose a noise-robust retrieval framework that enhances the ability of the retrieval model to handle noisy data. The framework integrates two key strategies: Hybrid Patch-Channel Masking (PC-Mask) to refine the model architecture, and Noise-Guided Progressive Training (NP-Train) to enhance the training process. PC-Mask performs masking on the input data at both the patch-level and the channel-level to prevent overfitting noisy supervision. NP-Train introduces a progressive training schedule based on the noise level of pseudo-texts to facilitate noise-robust learning. Extensive experiments on multiple TBPS benchmarks show that the proposed framework achieves promising performance under the semi-supervised setting.
Paper Structure (38 sections, 13 equations, 8 figures, 8 tables)

This paper contains 38 sections, 13 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Comparisons with (a) common fully-supervised TBPS, (b) TBPS w/o parallel image-text data and (c) the proposed semi-supervised TBPS, and (d) performance comparison of the methods with above settings (i.e., the method bai2023text for TBPS w/o parallel image-text data and the method jiang2023irra for the fully-supervised TBPS) on CUHK-PEDES li2017person.
  • Figure 2: Visualization of human annotated texts and generated pseudo-texts from the vision-language model BLIP li2022blip under the zero-shot setting and finetuned on 1% labeled data. Pseudo-texts from zero-shot BLIP tend to be coarse-grained while those from finetuned BLIP possess more fine-grained details but may contain inevitable noise. The noise is highlighted in red. More examples are shown in the Appendix.
  • Figure 3: Illustration of the proposed generation-then-retrieval solution along with the noise-robust retrieval framework.
  • Figure 4: Study on effectiveness and generalizability of the proposed noise-robust retrieval framework (NRF) on CUHK-PEDES with decreasing ratios of labeled data.
  • Figure 5: Study on the labeled ratio on CUHK-PEDES.
  • ...and 3 more figures