Table of Contents
Fetching ...

BPJDet: Extended Object Representation for Generic Body-Part Joint Detection

Huayi Zhou, Fei Jiang, Jiaxin Si, Yue Ding, Hongtao Lu

TL;DR

BPJDet tackles the challenging problem of joint detection and association of human bodies and their parts by introducing an extended object representation that appends center-offsets to body parts. It supports both anchor-based and anchor-free detectors, uses a multi-task loss to train detection and association end-to-end, and employs an association decoding scheme to link parts to bodies without post-matching. Across CityPersons, CrowdHuman, BodyHands, COCOHumanParts, and Animals5C, BPJDet delivers state-of-the-art body–part association while preserving detection accuracy, and its applications to accurate crowd head detection and hand contact estimation demonstrate practical impact. The approach generalizes to animals and is released open-source, offering a versatile baseline for future body–part joint detection work.

Abstract

Detection of human body and its parts has been intensively studied. However, most of CNNs-based detectors are trained independently, making it difficult to associate detected parts with body. In this paper, we focus on the joint detection of human body and its parts. Specifically, we propose a novel extended object representation integrating center-offsets of body parts, and construct an end-to-end generic Body-Part Joint Detector (BPJDet). In this way, body-part associations are neatly embedded in a unified representation containing both semantic and geometric contents. Therefore, we can optimize multi-loss to tackle multi-tasks synergistically. Moreover, this representation is suitable for anchor-based and anchor-free detectors. BPJDet does not suffer from error-prone post matching, and keeps a better trade-off between speed and accuracy. Furthermore, BPJDet can be generalized to detect body-part or body-parts of either human or quadruped animals. To verify the superiority of BPJDet, we conduct experiments on datasets of body-part (CityPersons, CrowdHuman and BodyHands) and body-parts (COCOHumanParts and Animals5C). While keeping high detection accuracy, BPJDet achieves state-of-the-art association performance on all datasets. Besides, we show benefits of advanced body-part association capability by improving performance of two representative downstream applications: accurate crowd head detection and hand contact estimation. Project is available in https://hnuzhy.github.io/projects/BPJDet.

BPJDet: Extended Object Representation for Generic Body-Part Joint Detection

TL;DR

BPJDet tackles the challenging problem of joint detection and association of human bodies and their parts by introducing an extended object representation that appends center-offsets to body parts. It supports both anchor-based and anchor-free detectors, uses a multi-task loss to train detection and association end-to-end, and employs an association decoding scheme to link parts to bodies without post-matching. Across CityPersons, CrowdHuman, BodyHands, COCOHumanParts, and Animals5C, BPJDet delivers state-of-the-art body–part association while preserving detection accuracy, and its applications to accurate crowd head detection and hand contact estimation demonstrate practical impact. The approach generalizes to animals and is released open-source, offering a versatile baseline for future body–part joint detection work.

Abstract

Detection of human body and its parts has been intensively studied. However, most of CNNs-based detectors are trained independently, making it difficult to associate detected parts with body. In this paper, we focus on the joint detection of human body and its parts. Specifically, we propose a novel extended object representation integrating center-offsets of body parts, and construct an end-to-end generic Body-Part Joint Detector (BPJDet). In this way, body-part associations are neatly embedded in a unified representation containing both semantic and geometric contents. Therefore, we can optimize multi-loss to tackle multi-tasks synergistically. Moreover, this representation is suitable for anchor-based and anchor-free detectors. BPJDet does not suffer from error-prone post matching, and keeps a better trade-off between speed and accuracy. Furthermore, BPJDet can be generalized to detect body-part or body-parts of either human or quadruped animals. To verify the superiority of BPJDet, we conduct experiments on datasets of body-part (CityPersons, CrowdHuman and BodyHands) and body-parts (COCOHumanParts and Animals5C). While keeping high detection accuracy, BPJDet achieves state-of-the-art association performance on all datasets. Besides, we show benefits of advanced body-part association capability by improving performance of two representative downstream applications: accurate crowd head detection and hand contact estimation. Project is available in https://hnuzhy.github.io/projects/BPJDet.
Paper Structure (34 sections, 14 equations, 12 figures, 9 tables, 1 algorithm)

This paper contains 34 sections, 14 equations, 12 figures, 9 tables, 1 algorithm.

Figures (12)

  • Figure 1: The illustration of the difference between our proposed single-stage BPJDet and other two-stage body-part joint detection methods (e.g., JointDet chi2020relational, BFJDet wan2021body, BodyHands narasimhaswamy2022whose and Hier R-CNN yang2020hier). Their two-stage refers to training the detection and association modules separately, unlike our one-stage joint detection and association framework. We visualize bodies and parts that belong to the same person using bounding boxes with the same color.
  • Figure 2: Illustrations of three popular human keypoints regression strategies: (1) center-offset regression used for direct structured pose representation, (2) part affinity fields used for 2D unit direction vectors learning, (3) hierarchical offset regression used for hierarchical structured pose representation. We migrate them to the body-part association.
  • Figure 3: Our BPJDet adopts YOLOv5 as the backbone $\mathcal{N}$ to extract features and predict grids $\mathit{\widehat{G}}$ from one augmented input image $\mathbf{I}$. During training, target grids $\mathit{G}$ are used to supervise the elaborately designed multi-loss function $\mathcal{L}$. In inference stage, NMS and association decoding algorithm are sequentially applied on predicted objects $\widehat{\mathbf{O}}$ to obtain final human body boxes set $\widetilde{\mathbf{O}}^{b'}$ and related body parts set $\widetilde{\mathbf{O}}^{p'}$.
  • Figure 4: Examples for grid cell predictions with human body objects in red color and body part objects (e.g., face) in green color. The "--" means not used when calculating training losses.
  • Figure 5: The influence of loss weight parameter $\lambda$ (x-axis, enlarged 100$\times$) on pairs of (a) MR$^{-2}$s (body or face) and mMR$^{-2}$, (b) AP-body and mMR$^{-2}$, and (c) AP-face and mMR$^{-2}$.
  • ...and 7 more figures