Table of Contents
Fetching ...

OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, Yaser Sheikh

TL;DR

The paper tackles realtime multi-person 2D pose estimation by introducing Part Affinity Fields (PAFs), a bottom-up, nonparametric representation that encodes limb location and orientation to efficiently associate body parts across many people. A multi-stage CNN jointly predicts PAFs and body-part confidence maps, with a greedy parsing algorithm assembling poses without heavy global optimization. OpenPose, the open-source system resulting from this work, achieves real-time performance across body, foot, hand, and facial keypoints and demonstrates strong results on MPII and COCO, while enabling broad applicability and multi-view 3D extension. The authors also release a dedicated foot keypoint dataset and show that combining body and foot detectors maintains accuracy and speeds up inference, illustrating practical impact for real-time human analysis tasks.

Abstract

Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that a PAF-only refinement rather than both PAF and body part location refinement results in a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an internal annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.

OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

TL;DR

The paper tackles realtime multi-person 2D pose estimation by introducing Part Affinity Fields (PAFs), a bottom-up, nonparametric representation that encodes limb location and orientation to efficiently associate body parts across many people. A multi-stage CNN jointly predicts PAFs and body-part confidence maps, with a greedy parsing algorithm assembling poses without heavy global optimization. OpenPose, the open-source system resulting from this work, achieves real-time performance across body, foot, hand, and facial keypoints and demonstrates strong results on MPII and COCO, while enabling broad applicability and multi-view 3D extension. The authors also release a dedicated foot keypoint dataset and show that combining body and foot detectors maintains accuracy and speeds up inference, illustrating practical impact for real-time human analysis tasks.

Abstract

Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that a PAF-only refinement rather than both PAF and body part location refinement results in a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an internal annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.

Paper Structure

This paper contains 20 sections, 14 equations, 19 figures, 9 tables.

Figures (19)

  • Figure 1: Top: Multi-person pose estimation. Body parts belonging to the same person are linked, including foot keypoints (big toes, small toes, and heels). Bottom left: Part Affinity Fields (PAFs) corresponding to the limb connecting right elbow and wrist. The color encodes orientation. Bottom right: A 2D vector in each pixel of every PAF encodes the position and orientation of the limbs.
  • Figure 2: Overall pipeline. (a) Our method takes the entire image as the input for a CNN to jointly predict (b) confidence maps for body part detection and (c) PAFs for part association. (d) The parsing step performs a set of bipartite matchings to associate body part candidates. (e) We finally assemble them into full body poses for all people in the image.
  • Figure 3: Architecture of the multi-stage CNN. The first set of stages predicts PAFs $\mathbf{L}^t$, while the last set predicts confidence maps $\mathbf{S}^t$. The predictions of each stage and their corresponding image features are concatenated for each subsequent stage. Convolutions of kernel size 7 from the original approach cao2017realtime are replaced with 3 layers of convolutions of kernel 3 which are concatenated at their end.
  • Figure 4: PAFs of right forearm across stages. Although there is confusion between left and right body parts and limbs in early stages, the estimates are increasingly refined through global inference in later stages.
  • Figure 5: Part association strategies. (a) The body part detection candidates (red and blue dots) for two body part types and all connection candidates (grey lines). (b) The connection results using the midpoint (yellow dots) representation: correct connections (black lines) and incorrect connections (green lines) that also satisfy the incidence constraint. (c) The results using PAFs (yellow arrows). By encoding position and orientation over the support of the limb, PAFs eliminate false associations.
  • ...and 14 more figures