Scene-Adaptive Person Search via Bilateral Modulations

Yimin Jiang; Huibing Wang; Jinjia Peng; Xianping Fu; Yang Wang

Scene-Adaptive Person Search via Bilateral Modulations

Yimin Jiang, Huibing Wang, Jinjia Peng, Xianping Fu, Yang Wang

TL;DR

This work tackles the challenge of scene variability in person search, where background and foreground noise within detected bounding boxes degrade identity features. It introduces SEAS, a scene-adaptive framework that uses bilateral modulations—BMN to suppress background noise and FMN to compensate foreground noise—yielding stable person representations across scenes. Key innovations include a Multi-Granularity Embedding in BMN with a Background Noise Reduction loss, a noise-extractor and cross-attention denoiser in FMN, and a Bidirectional Online Instance Matching loss that combines OIM with a triplet. Experiments on CUHK-SYSU and PRW demonstrate state-of-the-art performance and robustness to cross-scene and cross-camera variations.

Abstract

Person search aims to localize specific a target person from a gallery set of images with various scenes. As the scene of moving pedestrian changes, the captured person image inevitably bring in lots of background noise and foreground noise on the person feature, which are completely unrelated to the person identity, leading to severe performance degeneration. To address this issue, we present a Scene-Adaptive Person Search (SEAS) model by introducing bilateral modulations to simultaneously eliminate scene noise and maintain a consistent person representation to adapt to various scenes. In SEAS, a Background Modulation Network (BMN) is designed to encode the feature extracted from the detected bounding box into a multi-granularity embedding, which reduces the input of background noise from multiple levels with norm-aware. Additionally, to mitigate the effect of foreground noise on the person feature, SEAS introduces a Foreground Modulation Network (FMN) to compute the clutter reduction offset for the person embedding based on the feature map of the scene image. By bilateral modulations on both background and foreground within an end-to-end manner, SEAS obtains consistent feature representations without scene noise. SEAS can achieve state-of-the-art (SOTA) performance on two benchmark datasets, CUHK-SYSU with 97.1\% mAP and PRW with 60.5\% mAP. The code is available at https://github.com/whbdmu/SEAS.

Scene-Adaptive Person Search via Bilateral Modulations

TL;DR

Abstract

Paper Structure (12 sections, 6 equations, 8 figures, 7 tables)

This paper contains 12 sections, 6 equations, 8 figures, 7 tables.

Introduction
Method
Framework overview
Background Modulation Network
Foreground Modulation Network
Bidirectional Online Instance Matching Loss
Experiments
Datasets and Evaluation Metric
Implementation
Comparison to the State-of-the-Arts
Ablation Study
Conclusion

Figures (8)

Figure 1: Composition of the person feature. The person feature consists of scene noise and pure person feature, while the scene noise can be divided into background noise, which comes from the residual background in the detected bounding box, and foreground noise, which is caused by the influence of lighting conditions, visibility, and so on.
Figure 2: Comparison of three person search strategies. (a) Using the projected person feature to search is the initial method, but its cross-scene ability is unsatisfactory due to the neglect of scene noise. (b) The binding strategy is to bind scene features to character features. It has excellent person retrieval when each person is in a fixed scene, but changing the scene causes the retrieval to deteriorate. (c) Our strategy is to leverage scene features to eliminate scene noises from person features, achieving adaptation to diverse scenes.
Figure 3: Architecture of the SEAS framework. The lower left corner of this figure is marked with the meaning indicated by the color of component. This figure can be divided into two rows, in a clockwise direction, starting at the top left and ending at the bottom left. Wrapped in a solid rounded box is the schematic of the component; wrapped in a dashed rounded box is the aim for the network.
Figure 4: Details of our multi-granularity embedding.
Figure 5: Details of our foreground modulation network.
...and 3 more figures

Scene-Adaptive Person Search via Bilateral Modulations

TL;DR

Abstract

Scene-Adaptive Person Search via Bilateral Modulations

Authors

TL;DR

Abstract

Table of Contents

Figures (8)