SAM 3D Body: Robust Full-Body Human Mesh Recovery
Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Dollar, Kris Kitani
TL;DR
SAM 3D Body (3DB) tackles single-image full-body 3D human mesh recovery by decoupling skeletal pose from body shape with Momentum Human Rig (MHR) and enabling promptable inference through 2D keypoints and masks. The model uses a shared image encoder with separate body and hand decoders and a rich prompt token system to unify body and hand estimation, while a large-scale data engine and multi-stage annotation pipeline provide diverse, high-quality supervision. Extensive evaluations show state-of-the-art generalization on standard benchmarks, strong hand pose accuracy, and robust performance across new, out-of-domain datasets, backed by a large human-subject preference study. The work highlights the practical impact of interactive prompting and diverse supervision for robust, real-world full-body HMR, and it releases both 3DB and MHR as open source.
Abstract
We introduce SAM 3D Body (3DB), a promptable model for single-image full-body 3D human mesh recovery (HMR) that demonstrates state-of-the-art performance, with strong generalization and consistent accuracy in diverse in-the-wild conditions. 3DB estimates the human pose of the body, feet, and hands. It is the first model to use a new parametric mesh representation, Momentum Human Rig (MHR), which decouples skeletal structure and surface shape. 3DB employs an encoder-decoder architecture and supports auxiliary prompts, including 2D keypoints and masks, enabling user-guided inference similar to the SAM family of models. We derive high-quality annotations from a multi-stage annotation pipeline that uses various combinations of manual keypoint annotation, differentiable optimization, multi-view geometry, and dense keypoint detection. Our data engine efficiently selects and processes data to ensure data diversity, collecting unusual poses and rare imaging conditions. We present a new evaluation dataset organized by pose and appearance categories, enabling nuanced analysis of model behavior. Our experiments demonstrate superior generalization and substantial improvements over prior methods in both qualitative user preference studies and traditional quantitative analysis. Both 3DB and MHR are open-source.
