Table of Contents
Fetching ...

Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery

Timing Yang, Sicheng He, Hongyi Jing, Jiawei Yang, Zhijian Liu, Chuhang Zou, Yue Wang

Abstract

SAM 3D Body (3DB) achieves state-of-the-art accuracy in monocular 3D human mesh recovery, yet its inference latency of several seconds per image precludes real-time application. We present Fast SAM 3D Body, a training-free acceleration framework that reformulates the 3DB inference pathway to achieve interactive rates. By decoupling serial spatial dependencies and applying architecture-aware pruning, we enable parallelized multi-crop feature extraction and streamlined transformer decoding. Moreover, to extract the joint-level kinematics (SMPL) compatible with existing humanoid control and policy learning frameworks, we replace the iterative mesh fitting with a direct feedforward mapping, accelerating this specific conversion by over 10,000x. Overall, our framework delivers up to a 10.9x end-to-end speedup while maintaining on-par reconstruction fidelity, even surpassing 3DB on benchmarks such as LSPET. We demonstrate its utility by deploying Fast SAM 3D Body in a vision-only teleoperation system that-unlike methods reliant on wearable IMUs-enables real-time humanoid control and the direct collection of manipulation policies from a single RGB stream.

Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery

Abstract

SAM 3D Body (3DB) achieves state-of-the-art accuracy in monocular 3D human mesh recovery, yet its inference latency of several seconds per image precludes real-time application. We present Fast SAM 3D Body, a training-free acceleration framework that reformulates the 3DB inference pathway to achieve interactive rates. By decoupling serial spatial dependencies and applying architecture-aware pruning, we enable parallelized multi-crop feature extraction and streamlined transformer decoding. Moreover, to extract the joint-level kinematics (SMPL) compatible with existing humanoid control and policy learning frameworks, we replace the iterative mesh fitting with a direct feedforward mapping, accelerating this specific conversion by over 10,000x. Overall, our framework delivers up to a 10.9x end-to-end speedup while maintaining on-par reconstruction fidelity, even surpassing 3DB on benchmarks such as LSPET. We demonstrate its utility by deploying Fast SAM 3D Body in a vision-only teleoperation system that-unlike methods reliant on wearable IMUs-enables real-time humanoid control and the direct collection of manipulation policies from a single RGB stream.
Paper Structure (39 sections, 9 equations, 11 figures, 11 tables)

This paper contains 39 sections, 9 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Speed--accuracy overview of Fast SAM 3D Body. Top left: Qualitative results on in-the-wild images show our framework preserves high-fidelity reconstruction. Top right: Our method achieves up to a $10.25\times$ end-to-end speedup over SAM 3D Body sam3db and replaces the iterative MHR-to-SMPL bottleneck MHR:2025 with a $10{,}000\times$ faster neural mapping. Bottom: Our system enables real-time humanoid robot control from a single RGB stream at ${\sim}65$ ms per frame on an NVIDIA RTX 5090.
  • Figure 2: Overview of the Fast SAM 3D Body pipeline. A pose detector predicts 2D body keypoints from which body and hand coarse bboxes are derived simultaneously, enabling all crops to be encoded in a single batched backbone pass. A lightweight FOV estimator predicts camera intrinsics. The body and hand decoders process the resulting features, and their outputs are merged to produce the final MHR mesh.
  • Figure 3: Inference pipeline comparison. Original 3DB pipeline with serial execution (left) and our accelerated variant with batched execution (right).
  • Figure 4: Qualitative comparison. The original SAM 3D Body (left) and our Fast variant (right) yield visually comparable mesh reconstructions across diverse poses and multi-person scenes on 3DPW 3dpw and EMDB emdb.
  • Figure 5: Hand reconstruction. (Left) Quantitative comparison on FreiHAND 2019freihand: our lightweight hand decoder maintains competitive accuracy with $4.6\!\times$ faster inference. (Right) Qualitative results.
  • ...and 6 more figures