Table of Contents
Fetching ...

Boosting Weak Positives for Text Based Person Search

Akshay Modi, Ashhar Aziz, Nilanjana Chatterjee, A V Subramanyam

TL;DR

Text-based person search suffers from weakly aligned image-text pairs and data noise. The authors propose a boosting mechanism that dynamically emphasizes rank-$k$ weak positives by modifying loss weights via $w_b^k(i)$ within a CLIP-based ITC framework. Extending to multiple losses, including ID and SDM, the boosting module also improves IRRA and RDE baselines and yields strong gains across four pedestrian datasets. The approach yields robust improvements under distractors and across cross-dataset evaluations, highlighting its practical value for real-world TBPS systems.

Abstract

Large vision-language models have revolutionized cross-modal object retrieval, but text-based person search (TBPS) remains a challenging task due to limited data and fine-grained nature of the task. Existing methods primarily focus on aligning image-text pairs into a common representation space, often disregarding the fact that real world positive image-text pairs share a varied degree of similarity in between them. This leads models to prioritize easy pairs, and in some recent approaches, challenging samples are discarded as noise during training. In this work, we introduce a boosting technique that dynamically identifies and emphasizes these challenging samples during training. Our approach is motivated from classical boosting technique and dynamically updates the weights of the weak positives, wherein, the rank-1 match does not share the identity of the query. The weight allows these misranked pairs to contribute more towards the loss and the network has to pay more attention towards such samples. Our method achieves improved performance across four pedestrian datasets, demonstrating the effectiveness of our proposed module.

Boosting Weak Positives for Text Based Person Search

TL;DR

Text-based person search suffers from weakly aligned image-text pairs and data noise. The authors propose a boosting mechanism that dynamically emphasizes rank- weak positives by modifying loss weights via within a CLIP-based ITC framework. Extending to multiple losses, including ID and SDM, the boosting module also improves IRRA and RDE baselines and yields strong gains across four pedestrian datasets. The approach yields robust improvements under distractors and across cross-dataset evaluations, highlighting its practical value for real-world TBPS systems.

Abstract

Large vision-language models have revolutionized cross-modal object retrieval, but text-based person search (TBPS) remains a challenging task due to limited data and fine-grained nature of the task. Existing methods primarily focus on aligning image-text pairs into a common representation space, often disregarding the fact that real world positive image-text pairs share a varied degree of similarity in between them. This leads models to prioritize easy pairs, and in some recent approaches, challenging samples are discarded as noise during training. In this work, we introduce a boosting technique that dynamically identifies and emphasizes these challenging samples during training. Our approach is motivated from classical boosting technique and dynamically updates the weights of the weak positives, wherein, the rank-1 match does not share the identity of the query. The weight allows these misranked pairs to contribute more towards the loss and the network has to pay more attention towards such samples. Our method achieves improved performance across four pedestrian datasets, demonstrating the effectiveness of our proposed module.

Paper Structure

This paper contains 17 sections, 11 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Illustrative presentation of how positive and negative samples behave before and after our boosting method. Boosting increases weight of the positive pair which occurs at rank-$k$ ($k=2$ in this case). After boosting, the negative sample is pushed away from the query while the second most similar positive sample is pulled.
  • Figure 2: Images and captions are encoded using image and text encoders. The embeddings are used to calculate a similarity matrix, which interacts with the boosting module to dynamically identify and enhance weak positives. The similarity matrix and boosting weights are then passed to the training objective.
  • Figure 3: Top: Ablation for $k$ and $exp(\alpha)$. Bottom: Ablation for number of epoch before which weights are updated.
  • Figure 4: Qualitative results of images retrieved by CLIP and CLIP+B. The green boxes indicate the correct matches. It can be seen that rank-2 correct matches are pushed to rank-1 when retrieved by our boosted model compared to the baseline.