Swap Path Network for Robust Person Search Pre-training
Lucas Jaffe, Avideh Zakhor
TL;DR
Swap Path Network (SPNet) introduces end-to-end pre-training for person search by unifying query-centric (QC) and object-centric (OC) training within a single architecture. QC pre-training on full scenes with weakly-labeled bounding boxes yields more transferable features than traditional backbone pre-training, and OC fine-tuning preserves efficient inference. Empirical results on CUHK-SYSU and PRW demonstrate state-of-the-art performance, with SPNet-L achieving 96.4% mAP on CUHK-SYSU and 61.2% mAP on PRW, and QC pre-training providing consistent gains over OC and backbone-only baselines. The method is robust to label noise and offers improved pre-training efficiency, highlighting practical benefits for scalable, end-to-end person search pipelines.
Abstract
In person search, we detect and rank matches to a query person image within a set of gallery scenes. Most person search models make use of a feature extraction backbone, followed by separate heads for detection and re-identification. While pre-training methods for vision backbones are well-established, pre-training additional modules for the person search task has not been previously examined. In this work, we present the first framework for end-to-end person search pre-training. Our framework splits person search into object-centric and query-centric methodologies, and we show that the query-centric framing is robust to label noise, and trainable using only weakly-labeled person bounding boxes. Further, we provide a novel model dubbed Swap Path Net (SPNet) which implements both query-centric and object-centric training objectives, and can swap between the two while using the same weights. Using SPNet, we show that query-centric pre-training, followed by object-centric fine-tuning, achieves state-of-the-art results on the standard PRW and CUHK-SYSU person search benchmarks, with 96.4% mAP on CUHK-SYSU and 61.2% mAP on PRW. In addition, we show that our method is more effective, efficient, and robust for person search pre-training than recent backbone-only pre-training alternatives.
