Table of Contents
Fetching ...

SAM 3D Body: Robust Full-Body Human Mesh Recovery

Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Dollar, Kris Kitani

TL;DR

SAM 3D Body (3DB) tackles single-image full-body 3D human mesh recovery by decoupling skeletal pose from body shape with Momentum Human Rig (MHR) and enabling promptable inference through 2D keypoints and masks. The model uses a shared image encoder with separate body and hand decoders and a rich prompt token system to unify body and hand estimation, while a large-scale data engine and multi-stage annotation pipeline provide diverse, high-quality supervision. Extensive evaluations show state-of-the-art generalization on standard benchmarks, strong hand pose accuracy, and robust performance across new, out-of-domain datasets, backed by a large human-subject preference study. The work highlights the practical impact of interactive prompting and diverse supervision for robust, real-world full-body HMR, and it releases both 3DB and MHR as open source.

Abstract

We introduce SAM 3D Body (3DB), a promptable model for single-image full-body 3D human mesh recovery (HMR) that demonstrates state-of-the-art performance, with strong generalization and consistent accuracy in diverse in-the-wild conditions. 3DB estimates the human pose of the body, feet, and hands. It is the first model to use a new parametric mesh representation, Momentum Human Rig (MHR), which decouples skeletal structure and surface shape. 3DB employs an encoder-decoder architecture and supports auxiliary prompts, including 2D keypoints and masks, enabling user-guided inference similar to the SAM family of models. We derive high-quality annotations from a multi-stage annotation pipeline that uses various combinations of manual keypoint annotation, differentiable optimization, multi-view geometry, and dense keypoint detection. Our data engine efficiently selects and processes data to ensure data diversity, collecting unusual poses and rare imaging conditions. We present a new evaluation dataset organized by pose and appearance categories, enabling nuanced analysis of model behavior. Our experiments demonstrate superior generalization and substantial improvements over prior methods in both qualitative user preference studies and traditional quantitative analysis. Both 3DB and MHR are open-source.

SAM 3D Body: Robust Full-Body Human Mesh Recovery

TL;DR

SAM 3D Body (3DB) tackles single-image full-body 3D human mesh recovery by decoupling skeletal pose from body shape with Momentum Human Rig (MHR) and enabling promptable inference through 2D keypoints and masks. The model uses a shared image encoder with separate body and hand decoders and a rich prompt token system to unify body and hand estimation, while a large-scale data engine and multi-stage annotation pipeline provide diverse, high-quality supervision. Extensive evaluations show state-of-the-art generalization on standard benchmarks, strong hand pose accuracy, and robust performance across new, out-of-domain datasets, backed by a large human-subject preference study. The work highlights the practical impact of interactive prompting and diverse supervision for robust, real-world full-body HMR, and it releases both 3DB and MHR as open source.

Abstract

We introduce SAM 3D Body (3DB), a promptable model for single-image full-body 3D human mesh recovery (HMR) that demonstrates state-of-the-art performance, with strong generalization and consistent accuracy in diverse in-the-wild conditions. 3DB estimates the human pose of the body, feet, and hands. It is the first model to use a new parametric mesh representation, Momentum Human Rig (MHR), which decouples skeletal structure and surface shape. 3DB employs an encoder-decoder architecture and supports auxiliary prompts, including 2D keypoints and masks, enabling user-guided inference similar to the SAM family of models. We derive high-quality annotations from a multi-stage annotation pipeline that uses various combinations of manual keypoint annotation, differentiable optimization, multi-view geometry, and dense keypoint detection. Our data engine efficiently selects and processes data to ensure data diversity, collecting unusual poses and rare imaging conditions. We present a new evaluation dataset organized by pose and appearance categories, enabling nuanced analysis of model behavior. Our experiments demonstrate superior generalization and substantial improvements over prior methods in both qualitative user preference studies and traditional quantitative analysis. Both 3DB and MHR are open-source.
Paper Structure (25 sections, 6 equations, 9 figures, 8 tables)

This paper contains 25 sections, 6 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Full-body human mesh recovery results using SAM 3D Body (3DB). Our model demonstrates robust performance in estimating challenging poses across diverse viewpoints and produces accurate body and hand pose estimations within a unified framework.
  • Figure 2: SAM 3D Body Model Architecture. We employ a promptable encoder–decoder architecture with a shared image encoder and separate decoders for body and hand pose estimation.
  • Figure 3: Left: GUI of our annotation tool for annotating 2D keypoints. Right: Comparison of the dense (thin) and sparse (thick) keypoints for pseudo annotation.
  • Figure 4: Example of single-image MHR mesh fitting for ITW datasets. Source: SA-1B kirillov2023sa1b.
  • Figure 5: Examples of MHR mesh fitting results. (a) Multi-view mesh fitting. Source: EgoExo4D grauman2024egoexo4d. (b) Scan-based mesh fitting. Source: Re:Interhand mueller2023reinterhand.
  • ...and 4 more figures