Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation

Jie Yang; Ailing Zeng; Shilong Liu; Feng Li; Ruimao Zhang; Lei Zhang

Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation

Jie Yang, Ailing Zeng, Shilong Liu, Feng Li, Ruimao Zhang, Lei Zhang

TL;DR

<3-5 sentence high-level summary> ED-Pose introduces two explicit box-detection stages—one for humans and one for keypoints—into a unified, end-to-end pose estimation framework. By using a Human Detection Decoder to initialize global context and a Human-to-Keypoint Detection Decoder to fuse local context via 4D keypoint boxes, the method avoids post-processing and dense heatmaps. The approach employs a set-based Hungarian loss with L1/OKS-style keypoint regression, and incorporates interactive learning to blend global and local information effectively. Empirically, ED-Pose achieves state-of-the-art results on CrowdPose and competitive performance on COCO while offering significant efficiency advantages over prior DETR-based and end-to-end methods.

Abstract

This paper presents a novel end-to-end framework with Explicit box Detection for multi-person Pose estimation, called ED-Pose, where it unifies the contextual learning between human-level (global) and keypoint-level (local) information. Different from previous one-stage methods, ED-Pose re-considers this task as two explicit box detection processes with a unified representation and regression supervision. First, we introduce a human detection decoder from encoded tokens to extract global features. It can provide a good initialization for the latter keypoint detection, making the training process converge fast. Second, to bring in contextual information near keypoints, we regard pose estimation as a keypoint box detection problem to learn both box positions and contents for each keypoint. A human-to-keypoint detection decoder adopts an interactive learning strategy between human and keypoint features to further enhance global and local feature aggregation. In general, ED-Pose is conceptually simple without post-processing and dense heatmap supervision. It demonstrates its effectiveness and efficiency compared with both two-stage and one-stage methods. Notably, explicit box detection boosts the pose estimation performance by 4.5 AP on COCO and 9.9 AP on CrowdPose. For the first time, as a fully end-to-end framework with a L1 regression loss, ED-Pose surpasses heatmap-based Top-down methods under the same backbone by 1.2 AP on COCO and achieves the state-of-the-art with 76.6 AP on CrowdPose without bells and whistles. Code is available at https://github.com/IDEA-Research/ED-Pose.

Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation

TL;DR

Abstract

Paper Structure (19 sections, 3 equations, 7 figures, 10 tables)

This paper contains 19 sections, 3 equations, 7 figures, 10 tables.

Introduction
Related work
Rethinking One-stage Multi-Person pose estimation
Methodology
Overview
Human Detection Decoder
Human-to-Keypoint Detection Decoder
Experiments
Results on CrowdPose
Results on COCO
Comparison of Effectiveness
Comparison of Efficiency
Ablation Study
Conclusion
Experiment Setup
...and 4 more sections

Figures (7)

Figure 1: Illustration of (a) the perception of the pose estimation task that usually captures global and local contexts concurrently; (b) a taxonomy of existing estimators. ED-Pose (Ours) is a novel one-stage method of learning both global and local relations in an end-to-end manner.
Figure 2: The overview architecture of our ED-Pose, which contains a Human Detection Decoder and a Human-to-Keypoint Detection Decoder to detect human and keypoint boxes explicitly.
Figure 3: The detailed illustration of (a) Human Detection Decoder, (b) Human-to-Keypoint Detection Decoder and (c) the detailed Interactive Learning in Human-to-Keypoint Detection Decoder.
Figure 4: Qualitative results of ED-Pose on COCO (the first row) and CrowdPose (the second row). We present both explicitly detected person boxes and keypoint boxes to understand how they work.
Figure 5: Comparisons of convergence speeds in the training stage (the left) and trade-offs between inference time and performance (the right) of existing mainstream methods. Our proposed one-stage method ED-Pose shows the superiority of efficiency compared with the Bottom-Up (BU) model HigherHRNet cheng2020higherhrnet, Top-Down (TD) models Sim.Base. xiao2018simple and DETR-based Poseur mao2022poseur, the one-stage method PETR shi2022end.
...and 2 more figures

Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation

TL;DR

Abstract

Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)