Joint Coordinate Regression and Association For Multi-Person Pose Estimation, A Pure Neural Network Approach
Dongyang Yu, Yunshi Xie, Wangpeng An, Li Zhang, Yufeng Yao
TL;DR
This paper presents JCRA, a transformer-based, one-stage end-to-end method for multi-person 2D pose estimation that directly regresses keypoint coordinates and learns associations without post-processing. It introduces a symmetric encoder-decoder with deformable attention, a 300-query pose decoder, and a Hungarian-based loss to enable set-based prediction of full-body poses. JCRA achieves competitive or superior performance on COCO and CrowdPose, notably reaching 69.2 AP on COCO val2017 and offering substantial speed gains over prior bottom-up approaches. The method demonstrates practical potential for real-time applications and lays groundwork for integrating keypoint and bounding-box outputs in future work.
Abstract
We introduce a novel one-stage end-to-end multi-person 2D pose estimation algorithm, known as Joint Coordinate Regression and Association (JCRA), that produces human pose joints and associations without requiring any post-processing. The proposed algorithm is fast, accurate, effective, and simple. The one-stage end-to-end network architecture significantly improves the inference speed of JCRA. Meanwhile, we devised a symmetric network structure for both the encoder and decoder, which ensures high accuracy in identifying keypoints. It follows an architecture that directly outputs part positions via a transformer network, resulting in a significant improvement in performance. Extensive experiments on the MS COCO and CrowdPose benchmarks demonstrate that JCRA outperforms state-of-the-art approaches in both accuracy and efficiency. Moreover, JCRA demonstrates 69.2 mAP and is 78\% faster at inference acceleration than previous state-of-the-art bottom-up algorithms. The code for this algorithm will be publicly available.
