Table of Contents
Fetching ...

Swap Path Network for Robust Person Search Pre-training

Lucas Jaffe, Avideh Zakhor

TL;DR

Swap Path Network (SPNet) introduces end-to-end pre-training for person search by unifying query-centric (QC) and object-centric (OC) training within a single architecture. QC pre-training on full scenes with weakly-labeled bounding boxes yields more transferable features than traditional backbone pre-training, and OC fine-tuning preserves efficient inference. Empirical results on CUHK-SYSU and PRW demonstrate state-of-the-art performance, with SPNet-L achieving 96.4% mAP on CUHK-SYSU and 61.2% mAP on PRW, and QC pre-training providing consistent gains over OC and backbone-only baselines. The method is robust to label noise and offers improved pre-training efficiency, highlighting practical benefits for scalable, end-to-end person search pipelines.

Abstract

In person search, we detect and rank matches to a query person image within a set of gallery scenes. Most person search models make use of a feature extraction backbone, followed by separate heads for detection and re-identification. While pre-training methods for vision backbones are well-established, pre-training additional modules for the person search task has not been previously examined. In this work, we present the first framework for end-to-end person search pre-training. Our framework splits person search into object-centric and query-centric methodologies, and we show that the query-centric framing is robust to label noise, and trainable using only weakly-labeled person bounding boxes. Further, we provide a novel model dubbed Swap Path Net (SPNet) which implements both query-centric and object-centric training objectives, and can swap between the two while using the same weights. Using SPNet, we show that query-centric pre-training, followed by object-centric fine-tuning, achieves state-of-the-art results on the standard PRW and CUHK-SYSU person search benchmarks, with 96.4% mAP on CUHK-SYSU and 61.2% mAP on PRW. In addition, we show that our method is more effective, efficient, and robust for person search pre-training than recent backbone-only pre-training alternatives.

Swap Path Network for Robust Person Search Pre-training

TL;DR

Swap Path Network (SPNet) introduces end-to-end pre-training for person search by unifying query-centric (QC) and object-centric (OC) training within a single architecture. QC pre-training on full scenes with weakly-labeled bounding boxes yields more transferable features than traditional backbone pre-training, and OC fine-tuning preserves efficient inference. Empirical results on CUHK-SYSU and PRW demonstrate state-of-the-art performance, with SPNet-L achieving 96.4% mAP on CUHK-SYSU and 61.2% mAP on PRW, and QC pre-training providing consistent gains over OC and backbone-only baselines. The method is robust to label noise and offers improved pre-training efficiency, highlighting practical benefits for scalable, end-to-end person search pipelines.

Abstract

In person search, we detect and rank matches to a query person image within a set of gallery scenes. Most person search models make use of a feature extraction backbone, followed by separate heads for detection and re-identification. While pre-training methods for vision backbones are well-established, pre-training additional modules for the person search task has not been previously examined. In this work, we present the first framework for end-to-end person search pre-training. Our framework splits person search into object-centric and query-centric methodologies, and we show that the query-centric framing is robust to label noise, and trainable using only weakly-labeled person bounding boxes. Further, we provide a novel model dubbed Swap Path Net (SPNet) which implements both query-centric and object-centric training objectives, and can swap between the two while using the same weights. Using SPNet, we show that query-centric pre-training, followed by object-centric fine-tuning, achieves state-of-the-art results on the standard PRW and CUHK-SYSU person search benchmarks, with 96.4% mAP on CUHK-SYSU and 61.2% mAP on PRW. In addition, we show that our method is more effective, efficient, and robust for person search pre-training than recent backbone-only pre-training alternatives.

Paper Structure

This paper contains 22 sections, 4 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Examples of label noise and annotation challenges in real annotated scenes, with white boxes used for annotations.
  • Figure 2: The standard person search model pre-training approach (shown top) pre-trains only backbone weights using either ImageNet-1k classification or person image crops from e.g., LUPerson. Our approach (shown bottom) initializes all model weights using full person scenes with multiple annotated persons. (LUPerson images from paper fu_unsupervised_2021)
  • Figure 3: Full SPNet architecture is shown on the left, with details about the Cascade Loop shown on the right. The subscripts $q, d, a$ stand for "query", "detection", and "anchor" respectively. The superscript "reid" means the embedding or loss is used for re-id, and the superscript "det" means the embedding is used for detection.
  • Figure 4: Query-centric (a) and object-centric (b) pathways of the SP Block. Note that the query-centric pathway takes as input a query embedding $x_q$ extracted from a person image, while the object-centric pathway predicts the query embedding $\hat{x}_q$ directly from a matching anchor embedding $x_a$ using the Bridge Layer $g_\phi$.
  • Figure 5: Visual comparison of object-centric (OC) and query-centric (QC) detection tasks between two augmentations of the same base image (Query, Gallery). One person box is annotated (ground truth $b_q$, $b_g$), while the other is not (missing annotation). Note that anchor box$_1$$b_a^{(1)}$ overlaps ground truth box $b_g$ while anchor box$_2$$b_a^{(2)}$ does not. Anchor embeddings $x_a$ are used to compute box offsets $\hat{r}$ and anchor probabilities $p$ either directly (OC) or relatively using $x_o = x_q - x_a$ (QC). We do not compute box offsets $\hat{r}$ for $b_a^{(2)}$ because it does not match any ground truth.
  • ...and 3 more figures