Table of Contents
Fetching ...

Robust Pedestrian Detection via Constructing Versatile Pedestrian Knowledge Bank

Sungjune Park, Hyunjun Kim, Yong Man Ro

TL;DR

This work tackles the generalization gap in pedestrian detection by building a versatile pedestrian knowledge bank derived from a large-scale pretrained model (CLIP). The bank is formed by quantizing generalized pedestrian embeddings and guiding them with a learnable hint to become task-compatible, then integrated into both region-proposal and query-based detectors via cross-attention. Empirical results on four public datasets demonstrate state-of-the-art performance and strong cross-framework transfer, with analyses confirming semantic coherence of bank contents and robustness across driving and surveillance scenes. The approach offers a practical pathway to transplant broad visual knowledge into domain-specific detection, enabling robust performance without requiring end-to-end retraining of the pretrained model.

Abstract

Pedestrian detection is a crucial field of computer vision research which can be adopted in various real-world applications (e.g., self-driving systems). However, despite noticeable evolution of pedestrian detection, pedestrian representations learned within a detection framework are usually limited to particular scene data in which they were trained. Therefore, in this paper, we propose a novel approach to construct versatile pedestrian knowledge bank containing representative pedestrian knowledge which can be applicable to various detection frameworks and adopted in diverse scenes. We extract generalized pedestrian knowledge from a large-scale pretrained model, and we curate them by quantizing most representative features and guiding them to be distinguishable from background scenes. Finally, we construct versatile pedestrian knowledge bank which is composed of such representations, and then we leverage it to complement and enhance pedestrian features within a pedestrian detection framework. Through comprehensive experiments, we validate the effectiveness of our method, demonstrating its versatility and outperforming state-of-the-art detection performances.

Robust Pedestrian Detection via Constructing Versatile Pedestrian Knowledge Bank

TL;DR

This work tackles the generalization gap in pedestrian detection by building a versatile pedestrian knowledge bank derived from a large-scale pretrained model (CLIP). The bank is formed by quantizing generalized pedestrian embeddings and guiding them with a learnable hint to become task-compatible, then integrated into both region-proposal and query-based detectors via cross-attention. Empirical results on four public datasets demonstrate state-of-the-art performance and strong cross-framework transfer, with analyses confirming semantic coherence of bank contents and robustness across driving and surveillance scenes. The approach offers a practical pathway to transplant broad visual knowledge into domain-specific detection, enabling robust performance without requiring end-to-end retraining of the pretrained model.

Abstract

Pedestrian detection is a crucial field of computer vision research which can be adopted in various real-world applications (e.g., self-driving systems). However, despite noticeable evolution of pedestrian detection, pedestrian representations learned within a detection framework are usually limited to particular scene data in which they were trained. Therefore, in this paper, we propose a novel approach to construct versatile pedestrian knowledge bank containing representative pedestrian knowledge which can be applicable to various detection frameworks and adopted in diverse scenes. We extract generalized pedestrian knowledge from a large-scale pretrained model, and we curate them by quantizing most representative features and guiding them to be distinguishable from background scenes. Finally, we construct versatile pedestrian knowledge bank which is composed of such representations, and then we leverage it to complement and enhance pedestrian features within a pedestrian detection framework. Through comprehensive experiments, we validate the effectiveness of our method, demonstrating its versatility and outperforming state-of-the-art detection performances.
Paper Structure (28 sections, 7 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 28 sections, 7 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: The overall concept of our approach. We extract generalized pedestrian knowledge from a large-scale pretrained model and curate them to be exemplary and task-compatible. The knowledge bank stores such knowledge, and it can be leveraged into various frameworks for robust pedestrian detection in diverse scene data.
  • Figure 2: The overall steps designed in the proposed approach. At the first step, we extract the knowledge embeddings of various instances from a large-scale pretrained image encoder. We quantize the most representative $\boldsymbol{f_q}$ and make them task-relevant by placing $\boldsymbol{f_h}$. Then we obtain task-compatible knowledge features $\boldsymbol{f_k}$. At the second step, we leverage $\boldsymbol{f_k}$ within a pedestrian detection framework.
  • Figure 3: The overview of leveraging task-compatible pedestrian knowledge $\boldsymbol{f_k}$. When pedestrian features $\boldsymbol{f_p}$ come in as query features, $\boldsymbol{f_k}$ functions as key and value features. So, $\boldsymbol{f_p}$ can refer to $\boldsymbol{f_k}$, distinguishable features from the bank, then the complemented pedestrian features $\boldsymbol{f_c}$ can be obtained.
  • Figure 4: The visualization analysis of semantics in the knowledge bank. We analyze which types of pedestrians are quantized together, and then we visualize the distribution of knowledge features using t-sne. Orange, green, and red $\times$ marks denote the 9th, 28th, and 43rd knowledge elements, respectively, while blue circles are for the others.
  • Figure 5: The visualization of detection results on diverse scenes. The yellow and red boxes mean ground-truth and prediction bounding boxes, respectively. The proposed method performs properly on general indoor/outdoor, surveillance, and driving environments. The images are zoomed in for the better visualization.
  • ...and 1 more figures