Table of Contents
Fetching ...

Joint Coordinate Regression and Association For Multi-Person Pose Estimation, A Pure Neural Network Approach

Dongyang Yu, Yunshi Xie, Wangpeng An, Li Zhang, Yufeng Yao

TL;DR

This paper presents JCRA, a transformer-based, one-stage end-to-end method for multi-person 2D pose estimation that directly regresses keypoint coordinates and learns associations without post-processing. It introduces a symmetric encoder-decoder with deformable attention, a 300-query pose decoder, and a Hungarian-based loss to enable set-based prediction of full-body poses. JCRA achieves competitive or superior performance on COCO and CrowdPose, notably reaching 69.2 AP on COCO val2017 and offering substantial speed gains over prior bottom-up approaches. The method demonstrates practical potential for real-time applications and lays groundwork for integrating keypoint and bounding-box outputs in future work.

Abstract

We introduce a novel one-stage end-to-end multi-person 2D pose estimation algorithm, known as Joint Coordinate Regression and Association (JCRA), that produces human pose joints and associations without requiring any post-processing. The proposed algorithm is fast, accurate, effective, and simple. The one-stage end-to-end network architecture significantly improves the inference speed of JCRA. Meanwhile, we devised a symmetric network structure for both the encoder and decoder, which ensures high accuracy in identifying keypoints. It follows an architecture that directly outputs part positions via a transformer network, resulting in a significant improvement in performance. Extensive experiments on the MS COCO and CrowdPose benchmarks demonstrate that JCRA outperforms state-of-the-art approaches in both accuracy and efficiency. Moreover, JCRA demonstrates 69.2 mAP and is 78\% faster at inference acceleration than previous state-of-the-art bottom-up algorithms. The code for this algorithm will be publicly available.

Joint Coordinate Regression and Association For Multi-Person Pose Estimation, A Pure Neural Network Approach

TL;DR

This paper presents JCRA, a transformer-based, one-stage end-to-end method for multi-person 2D pose estimation that directly regresses keypoint coordinates and learns associations without post-processing. It introduces a symmetric encoder-decoder with deformable attention, a 300-query pose decoder, and a Hungarian-based loss to enable set-based prediction of full-body poses. JCRA achieves competitive or superior performance on COCO and CrowdPose, notably reaching 69.2 AP on COCO val2017 and offering substantial speed gains over prior bottom-up approaches. The method demonstrates practical potential for real-time applications and lays groundwork for integrating keypoint and bounding-box outputs in future work.

Abstract

We introduce a novel one-stage end-to-end multi-person 2D pose estimation algorithm, known as Joint Coordinate Regression and Association (JCRA), that produces human pose joints and associations without requiring any post-processing. The proposed algorithm is fast, accurate, effective, and simple. The one-stage end-to-end network architecture significantly improves the inference speed of JCRA. Meanwhile, we devised a symmetric network structure for both the encoder and decoder, which ensures high accuracy in identifying keypoints. It follows an architecture that directly outputs part positions via a transformer network, resulting in a significant improvement in performance. Extensive experiments on the MS COCO and CrowdPose benchmarks demonstrate that JCRA outperforms state-of-the-art approaches in both accuracy and efficiency. Moreover, JCRA demonstrates 69.2 mAP and is 78\% faster at inference acceleration than previous state-of-the-art bottom-up algorithms. The code for this algorithm will be publicly available.
Paper Structure (20 sections, 2 equations, 5 figures, 4 tables)

This paper contains 20 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of (a) top-down, (b) bottom-up, (c) two-stage end-to-end methods and (d) one sate end-to-end methods. The last row show the overview of our Joint Coordinate Regression and Association (JCRA) algorithm. JCRA is a one-stage end-to-end method.
  • Figure 2: Overview of Joint Coordinate Regression and Association (JCRA) algorithm
  • Figure 3: The visualization results of the JCRA. The first row and the second row show the visualization results on COCO dataset, respectively. A wide range of poses can be handled by JCRA, including viewpoint change, occlusion, and crowded settings.
  • Figure 4: L represents the encoder and decoder layers. L =4,3 means that the number of layers of the keypoint encoder is 4, and the number of layers of pose decoder is 3. When L = 6,5, we got the highest score 69.2 mAP on COCO val2017 dataset.
  • Figure 5: A comparative analysis of speed and accuracy across various methods.