Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh
TL;DR
This work tackles realtime, multi-person 2D pose estimation by introducing Part Affinity Fields (PAFs), a bottom-up representation that jointly encodes body-part locations and limb orientations to inform global associations. A two-branch, multi-stage CNN predicts confidence maps for parts and PAFs, with stage-wise supervision and a greedy parsing strategy that constructs full poses efficiently. The key contributions are the PAF concept, an end-to-end trainable architecture for simultaneous detection and association, and a tree-structured, fast parsing algorithm that scales well with the number of people. Empirically, the approach achieves state-of-the-art or competitive performance on MPII and COCO benchmarks while delivering real-time performance, demonstrating strong practical impact for crowd scenes and interactive systems.
Abstract
We present an approach to efficiently detect the 2D pose of multiple people in an image. The approach uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. The architecture encodes global context, allowing a greedy bottom-up parsing step that maintains high accuracy while achieving realtime performance, irrespective of the number of people in the image. The architecture is designed to jointly learn part locations and their association via two branches of the same sequential prediction process. Our method placed first in the inaugural COCO 2016 keypoints challenge, and significantly exceeds the previous state-of-the-art result on the MPII Multi-Person benchmark, both in performance and efficiency.
