Table of Contents
Fetching ...

Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh

TL;DR

This work tackles realtime, multi-person 2D pose estimation by introducing Part Affinity Fields (PAFs), a bottom-up representation that jointly encodes body-part locations and limb orientations to inform global associations. A two-branch, multi-stage CNN predicts confidence maps for parts and PAFs, with stage-wise supervision and a greedy parsing strategy that constructs full poses efficiently. The key contributions are the PAF concept, an end-to-end trainable architecture for simultaneous detection and association, and a tree-structured, fast parsing algorithm that scales well with the number of people. Empirically, the approach achieves state-of-the-art or competitive performance on MPII and COCO benchmarks while delivering real-time performance, demonstrating strong practical impact for crowd scenes and interactive systems.

Abstract

We present an approach to efficiently detect the 2D pose of multiple people in an image. The approach uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. The architecture encodes global context, allowing a greedy bottom-up parsing step that maintains high accuracy while achieving realtime performance, irrespective of the number of people in the image. The architecture is designed to jointly learn part locations and their association via two branches of the same sequential prediction process. Our method placed first in the inaugural COCO 2016 keypoints challenge, and significantly exceeds the previous state-of-the-art result on the MPII Multi-Person benchmark, both in performance and efficiency.

Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

TL;DR

This work tackles realtime, multi-person 2D pose estimation by introducing Part Affinity Fields (PAFs), a bottom-up representation that jointly encodes body-part locations and limb orientations to inform global associations. A two-branch, multi-stage CNN predicts confidence maps for parts and PAFs, with stage-wise supervision and a greedy parsing strategy that constructs full poses efficiently. The key contributions are the PAF concept, an end-to-end trainable architecture for simultaneous detection and association, and a tree-structured, fast parsing algorithm that scales well with the number of people. Empirically, the approach achieves state-of-the-art or competitive performance on MPII and COCO benchmarks while delivering real-time performance, demonstrating strong practical impact for crowd scenes and interactive systems.

Abstract

We present an approach to efficiently detect the 2D pose of multiple people in an image. The approach uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. The architecture encodes global context, allowing a greedy bottom-up parsing step that maintains high accuracy while achieving realtime performance, irrespective of the number of people in the image. The architecture is designed to jointly learn part locations and their association via two branches of the same sequential prediction process. Our method placed first in the inaugural COCO 2016 keypoints challenge, and significantly exceeds the previous state-of-the-art result on the MPII Multi-Person benchmark, both in performance and efficiency.

Paper Structure

This paper contains 11 sections, 13 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Top: Multi-person pose estimation. Body parts belonging to the same person are linked. Bottom left: Part Affinity Fields (PAFs) corresponding to the limb connecting right elbow and right wrist. The color encodes orientation. Bottom right: A zoomed in view of the predicted PAFs. At each pixel in the field, a 2D vector encodes the position and orientation of the limbs.
  • Figure 2: Overall pipeline. Our method takes the entire image as the input for a two-branch CNN to jointly predict confidence maps for body part detection, shown in (b), and part affinity fields for parts association, shown in (c). The parsing step performs a set of bipartite matchings to associate body parts candidates (d). We finally assemble them into full body poses for all people in the image (e).
  • Figure 3: Architecture of the two-branch multi-stage CNN. Each stage in the first branch predicts confidence maps $\mathbf{S}^t$, and each stage in the second branch predicts PAFs $\mathbf{L}^t$. After each stage, the predictions from the two branches, along with the image features, are concatenated for next stage.
  • Figure 4: Confidence maps of the right wrist (first row) and PAFs (second row) of right forearm across stages. Although there is confusion between left and right body parts and limbs in early stages, the estimates are increasingly refined through global inference in later stages, as shown in the highlighted areas.
  • Figure 5: Part association strategies. (a) The body part detection candidates (red and blue dots) for two body part types and all connection candidates (grey lines). (b) The connection results using the midpoint (yellow dots) representation: correct connections (black lines) and incorrect connections (green lines) that also satisfy the incidence constraint. (c) The results using PAFs (yellow arrows). By encoding position and orientation over the support of the limb, PAFs eliminate false associations.
  • ...and 7 more figures