Towards Fully Decoupled End-to-End Person Search

Pengcheng Zhang; Xiao Bai; Jin Zheng; Xin Ning

Towards Fully Decoupled End-to-End Person Search

Pengcheng Zhang, Xiao Bai, Jin Zheng, Xin Ning

TL;DR

The paper addresses conflicting objectives between detection and re-id in end-to-end person search by removing shared parameters and introducing a task-incremental network that fully decouples the two sub-tasks. It presents a detection head $f_d$ and a separate re-id head $f_r$, connected via lightweight side-ada and side-fusion bridges, and trains them in two stages with Spatial-noise Augmentation to simulate realistic overlaps. The approach achieves superior results among decoupled models and remains competitive with state-of-the-art end-to-end methods on CUHK-SYSU and PRW, while improving detection performance and preserving end-to-end efficiency. These findings demonstrate that architectural and training-time decoupling can significantly enhance robustness and scalability of person search in unconstrained scenes, with practical impact for deployment in surveillance and analytics systems.

Abstract

End-to-end person search aims to jointly detect and re-identify a target person in raw scene images with a unified model. The detection task unifies all persons while the re-id task discriminates different identities, resulting in conflict optimal objectives. Existing works proposed to decouple end-to-end person search to alleviate such conflict. Yet these methods are still sub-optimal on one or two of the sub-tasks due to their partially decoupled models, which limits the overall person search performance. In this paper, we propose to fully decouple person search towards optimal person search. A task-incremental person search network is proposed to incrementally construct an end-to-end model for the detection and re-id sub-task, which decouples the model architecture for the two sub-tasks. The proposed task-incremental network allows task-incremental training for the two conflicting tasks. This enables independent learning for different objectives thus fully decoupled the model for persons earch. Comprehensive experimental evaluations demonstrate the effectiveness of the proposed fully decoupled models for end-to-end person search.

Towards Fully Decoupled End-to-End Person Search

TL;DR

and a separate re-id head

, connected via lightweight side-ada and side-fusion bridges, and trains them in two stages with Spatial-noise Augmentation to simulate realistic overlaps. The approach achieves superior results among decoupled models and remains competitive with state-of-the-art end-to-end methods on CUHK-SYSU and PRW, while improving detection performance and preserving end-to-end efficiency. These findings demonstrate that architectural and training-time decoupling can significantly enhance robustness and scalability of person search in unconstrained scenes, with practical impact for deployment in surveillance and analytics systems.

Abstract

Paper Structure (11 sections, 4 equations, 4 figures, 9 tables)

This paper contains 11 sections, 4 equations, 4 figures, 9 tables.

Introduction
Related Work
Method
Task-incremental Person Search Network
Task-incremental Model Training
Experiments
Datasets
Implementation Details
Analytical Studies
Comparison with State-of-the-art
Conclusion and Limitations

Figures (4)

Figure 1: Comparison of fully decoupled person search (b) with previous decoupled models naehoim (a). We employ cyan points to indicate the performance of the vanilla end-to-end model oim. (a)Left: Partially decoupled model in naehoim. Right: The green point indicates the performance of the partially decoupled model. By closing the two task-specific feature spaces, the upper bound of performance upon shared parameters $\mathbf{\theta}_s$ is boosted. (b)Left: The proposed fully decoupled person search network. Right: The pink points illustrate the performance of the proposed model. It eliminates the coupled parameters and achieves the optimum for both sub-tasks towards optimal person search.
Figure 2: (a) The proposed fully decoupled person search network which consists of a detection side-net, a re-id side-net, and the modules that bridge them. The model is trained incrementally by person detection and re-id tasks, which fully decouples the parameters for the two conflicting sub-tasks. (b) The architectures of homogeneous and heterogeneous side-fusions for their corresponding re-id side-net. These modules transfer knowledge from the trained detection side-net to the re-id side-net.
Figure 3: Illustration of the re-id head. Person feature maps are drawn from the output of 'conv4' and refined by the 'conv5' block. By consecutive global average pooling and batch normalization, this module produces 1-D person feature vectors.
Figure 4: Illustration of augmented person bounding boxes from the GT boxes. Center shifting and box scaling are employed to add spatial noises.

Towards Fully Decoupled End-to-End Person Search

TL;DR

Abstract

Towards Fully Decoupled End-to-End Person Search

Authors

TL;DR

Abstract

Table of Contents

Figures (4)