Table of Contents
Fetching ...

SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos

Mingqi Gao, Yunqi Miao, Jungong Han

TL;DR

SAM-Body4D introduces a training-free framework for temporally consistent 4D human mesh recovery from videos. It leverages identity-consistent masklets from a promptable video segmentation model, an Occlusion-Aware Refiner to recover occluded regions, and a parallel, mask-guided HMR stage that uses refined masks to produce stable per-frame meshes. The approach achieves improved temporal stability and robustness in-the-wild without retraining, and includes a padding-based parallel inference strategy for efficient multi-human processing. Overall, the method transfers pixel-level temporal continuity into coherent 4D reconstructions suitable for real-world video applications.

Abstract

Human Mesh Recovery (HMR) aims to reconstruct 3D human pose and shape from 2D observations and is fundamental to human-centric understanding in real-world scenarios. While recent image-based HMR methods such as SAM 3D Body achieve strong robustness on in-the-wild images, they rely on per-frame inference when applied to videos, leading to temporal inconsistency and degraded performance under occlusions. We address these issues without extra training by leveraging the inherent human continuity in videos. We propose SAM-Body4D, a training-free framework for temporally consistent and occlusion-robust HMR from videos. We first generate identity-consistent masklets using a promptable video segmentation model, then refine them with an Occlusion-Aware module to recover missing regions. The refined masklets guide SAM 3D Body to produce consistent full-body mesh trajectories, while a padding-based parallel strategy enables efficient multi-human inference. Experimental results demonstrate that SAM-Body4D achieves improved temporal stability and robustness in challenging in-the-wild videos, without any retraining. Our code and demo are available at: https://github.com/gaomingqi/sam-body4d.

SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos

TL;DR

SAM-Body4D introduces a training-free framework for temporally consistent 4D human mesh recovery from videos. It leverages identity-consistent masklets from a promptable video segmentation model, an Occlusion-Aware Refiner to recover occluded regions, and a parallel, mask-guided HMR stage that uses refined masks to produce stable per-frame meshes. The approach achieves improved temporal stability and robustness in-the-wild without retraining, and includes a padding-based parallel inference strategy for efficient multi-human processing. Overall, the method transfers pixel-level temporal continuity into coherent 4D reconstructions suitable for real-world video applications.

Abstract

Human Mesh Recovery (HMR) aims to reconstruct 3D human pose and shape from 2D observations and is fundamental to human-centric understanding in real-world scenarios. While recent image-based HMR methods such as SAM 3D Body achieve strong robustness on in-the-wild images, they rely on per-frame inference when applied to videos, leading to temporal inconsistency and degraded performance under occlusions. We address these issues without extra training by leveraging the inherent human continuity in videos. We propose SAM-Body4D, a training-free framework for temporally consistent and occlusion-robust HMR from videos. We first generate identity-consistent masklets using a promptable video segmentation model, then refine them with an Occlusion-Aware module to recover missing regions. The refined masklets guide SAM 3D Body to produce consistent full-body mesh trajectories, while a padding-based parallel strategy enables efficient multi-human inference. Experimental results demonstrate that SAM-Body4D achieves improved temporal stability and robustness in challenging in-the-wild videos, without any retraining. Our code and demo are available at: https://github.com/gaomingqi/sam-body4d.

Paper Structure

This paper contains 12 sections, 6 equations, 4 figures, 1 algorithm.

Figures (4)

  • Figure 1: Illustration of temporally consistent Human Mesh Recovery (HMR) from videos. (a) Input video frames. (b) Identity-consistent human masks, where each person is highlighted with a unique and consistent colour across frames. (c) Vanilla image-to-video HMR baseline using SAM 3D Body with automatic human detection and per-frame inference. Note that only the meshes corresponding to the masks in (b) are visualised here; if a mesh does not appear in a certain frame, it indicates that the corresponding person is not detected in that frame. (d) Our spatial-temporal consistent HMR, where the temporal continuity in masklets is directly propagated into the 4D human meshes. (e) Our full SAM-Body4D with occlusion-aware refinement. Across the 2nd–5th columns, SAM-Body4D recovers plausible and temporally stable reconstructions under occlusion. As these humans are heavily occluded, their complete meshes are visualised in the bottom-left corner for clearer observation.
  • Figure 2: Overall framework of the proposed SAM-Body4D. Given an input video with human prompts, SAM-Body4D operates on three main modules in a training-free manner. The Masklet Generator derives identity-consistent temporal masklets from the video to provide spatio-temporal tracking cues. The Occlusion-Aware Masklet Refiner enriches these masklets by recovering invisible body regions and stabilizing temporal alignment. Finally, the Mask-Guided HMR module uses refined masklets as spatial prompts to predict accurate and temporally coherent human meshes across the entire sequence.
  • Figure 3: Visualised comparisons between the vanilla image-to-video extension of SAM 3D-Body and our SAM-Body4D. (a) Input video frames. (b) Identity-consistent human masks. (c) Vanilla per-frame HMR results using SAM 3D-Body with automatic human detection, where missed detections lead to missing meshes. (d) Our SAM-Body4D maintains temporally continuous and identity-preserving mesh trajectories throughout the video by leveraging spatial-temporal masklet guidance.
  • Figure 4: Visualised comparisons between SAM-Body4D w/o and w/ Occlusion-Aware Masklet Refiner. (a) Input video frames; (b) Temporally consistent human masks, where each person is highlighted with a unique and consistent color across frames; (c) SAM-Body4D without Occlusion-Aware Masklet Refiner; (d) SAM-Body4D with Occlusion-Aware Masklet Refiner. Across the 2nd–6th columns, SAM-Body4D produces more robust reconstructions under occlusion (e.g., the blue-rendered person in the 2nd column, the purple-rendered people in the 3rd/4th column, and the green-rendered people in the 5th and 6th columns). Since these subjects are heavily occluded, their meshes without occlusion are shown at the bottom-left/bottom-right for clearer observation.