Table of Contents
Fetching ...

Deep Intra-Image Contrastive Learning for Weakly Supervised One-Step Person Search

Jiabei Wang, Yanwei Pang, Jiale Cao, Hanqing Sun, Zhuang Shao, Xuelong Li

TL;DR

The paper tackles weakly supervised one-step person search, where only bounding-box annotations are available. It introduces deep intra-image contrastive learning (DICL) within a Siamese framework, featuring spatial-invariant contrast (SIC) and occlusion-invariant contrast (OIC) to exploit instance-level information under spatial and occlusion variance. Through dense and many-to-one intra-image contrasts plus a masking-based occlusion strategy, DICL achieves state-of-the-art results among weakly supervised one-step methods on CUHK-SYSU and PRW, highlighting the value of deeper intra-image mining. This approach offers a simple yet effective baseline that can guide future work in weakly supervised person search and contrastive learning.

Abstract

Weakly supervised person search aims to perform joint pedestrian detection and re-identification (re-id) with only person bounding-box annotations. Recently, the idea of contrastive learning is initially applied to weakly supervised person search, where two common contrast strategies are memory-based contrast and intra-image contrast. We argue that current intra-image contrast is shallow, which suffers from spatial-level and occlusion-level variance. In this paper, we present a novel deep intra-image contrastive learning using a Siamese network. Two key modules are spatial-invariant contrast (SIC) and occlusion-invariant contrast (OIC). SIC performs many-to-one contrasts between two branches of Siamese network and dense prediction contrasts in one branch of Siamese network. With these many-to-one and dense contrasts, SIC tends to learn discriminative scale-invariant and location-invariant features to solve spatial-level variance. OIC enhances feature consistency with the masking strategy to learn occlusion-invariant features. Extensive experiments are performed on two person search datasets CUHK-SYSU and PRW, respectively. Our method achieves a state-of-the-art performance among weakly supervised one-step person search approaches. We hope that our simple intra-image contrastive learning can provide more paradigms on weakly supervised person search. The source code is available at \url{https://github.com/jiabeiwangTJU/DICL}.

Deep Intra-Image Contrastive Learning for Weakly Supervised One-Step Person Search

TL;DR

The paper tackles weakly supervised one-step person search, where only bounding-box annotations are available. It introduces deep intra-image contrastive learning (DICL) within a Siamese framework, featuring spatial-invariant contrast (SIC) and occlusion-invariant contrast (OIC) to exploit instance-level information under spatial and occlusion variance. Through dense and many-to-one intra-image contrasts plus a masking-based occlusion strategy, DICL achieves state-of-the-art results among weakly supervised one-step methods on CUHK-SYSU and PRW, highlighting the value of deeper intra-image mining. This approach offers a simple yet effective baseline that can guide future work in weakly supervised person search and contrastive learning.

Abstract

Weakly supervised person search aims to perform joint pedestrian detection and re-identification (re-id) with only person bounding-box annotations. Recently, the idea of contrastive learning is initially applied to weakly supervised person search, where two common contrast strategies are memory-based contrast and intra-image contrast. We argue that current intra-image contrast is shallow, which suffers from spatial-level and occlusion-level variance. In this paper, we present a novel deep intra-image contrastive learning using a Siamese network. Two key modules are spatial-invariant contrast (SIC) and occlusion-invariant contrast (OIC). SIC performs many-to-one contrasts between two branches of Siamese network and dense prediction contrasts in one branch of Siamese network. With these many-to-one and dense contrasts, SIC tends to learn discriminative scale-invariant and location-invariant features to solve spatial-level variance. OIC enhances feature consistency with the masking strategy to learn occlusion-invariant features. Extensive experiments are performed on two person search datasets CUHK-SYSU and PRW, respectively. Our method achieves a state-of-the-art performance among weakly supervised one-step person search approaches. We hope that our simple intra-image contrastive learning can provide more paradigms on weakly supervised person search. The source code is available at \url{https://github.com/jiabeiwangTJU/DICL}.
Paper Structure (17 sections, 4 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 17 sections, 4 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Intra-image contrast strategies of different weakly supervised one-step methods. CGPS (a) performs dense prediction contrasts using a single network with the input of an entire image. R-SiamNet (b) performs a one-to-one ground-truth contrast using Siamese network with the inputs of an entire image and cropped persons. Compared to these shallow intra-image contrast strategies, our method (c) exploits deep intra-image contrastive learning. We consider many-to-one Siamese contrasts between two branches of Siamese network and dense prediction contrasts in one branch of Siamese network. In addition, we perform occlusion-invariant contrast by randomly masking a portion of person in one branch of Siamese network.
  • Figure 2: Architecture (a) of our deep intra-image contrastive learning (DICL) based method using a Siamese network, which has a search branch and an instance branch. DICL conducts intra-image contrast using two novel modules: spatial-invariant contrast (SIC) and occlusion-invariant contrast (OIC). SIC (b) performs many-to-one contrast in two branches of Siamese network and dense contrasts in all predictions of an image. OIC (c) enhances feature consistency using the masking strategy.
  • Figure 3: Qualitative results of our method on PRW test set, where the red box represents the query and the green box represent search result in the gallery image. Our method finds the query persons with various views and scales.
  • Figure 4: Qualitative results of our method on CUHK-SYSU test set, where the red box represents the query and the green box represent search result in the gallery image. Our method matches the query persons in different scenes.
  • Figure 5: Comparison with different weakly supervised one-step methods under different gallery sizes on CUHK-SYSU test set.
  • ...and 1 more figures