Table of Contents
Fetching ...

Human3R: Everyone Everywhere All at Once

Yue Chen, Xingyu Chen, Yuxuan Xue, Anpei Chen, Yuliang Xiu, Gerard Pons-Moll

TL;DR

Human3R addresses online, world-grounded 4D reconstruction of multiple humans and scenes from monocular video by a single, feed-forward model. It finetunes a compact visual-prompt layer on top of the CUT3R 4D foundation to jointly predict SMPL-X human meshes, dense scene geometry, and camera poses in real time, using head-token prompts and a human-prior encoder to read out multiple bodies. Trained on the BEDLAM synthetic dataset in one day on a single 48GB GPU, it achieves 15 FPS with an 8 GB memory footprint and demonstrates state-of-the-art or competitive results across local mesh recovery, global motion estimation, depth, and camera pose tasks within a single unified model. This framework advances real-time human-scene understanding with minimal dependencies and opens avenues for AR/VR, autonomous navigation, and human-robot interaction by providing a simple yet powerful baseline for end-to-end 4D reconstruction.

Abstract

We present Human3R, a unified, feed-forward framework for online 4D human-scene reconstruction, in the world frame, from casually captured monocular videos. Unlike previous approaches that rely on multi-stage pipelines, iterative contact-aware refinement between humans and scenes, and heavy dependencies, e.g., human detection, depth estimation, and SLAM pre-processing, Human3R jointly recovers global multi-person SMPL-X bodies ("everyone"), dense 3D scene ("everywhere"), and camera trajectories in a single forward pass ("all-at-once"). Our method builds upon the 4D online reconstruction model CUT3R, and uses parameter-efficient visual prompt tuning, to strive to preserve CUT3R's rich spatiotemporal priors, while enabling direct readout of multiple SMPL-X bodies. Human3R is a unified model that eliminates heavy dependencies and iterative refinement. After being trained on the relatively small-scale synthetic dataset BEDLAM for just one day on one GPU, it achieves superior performance with remarkable efficiency: it reconstructs multiple humans in a one-shot manner, along with 3D scenes, in one stage, at real-time speed (15 FPS) with a low memory footprint (8 GB). Extensive experiments demonstrate that Human3R delivers state-of-the-art or competitive performance across tasks, including global human motion estimation, local human mesh recovery, video depth estimation, and camera pose estimation, with a single unified model. We hope that Human3R will serve as a simple yet strong baseline, be easily extended for downstream applications.Code available in https://fanegg.github.io/Human3R

Human3R: Everyone Everywhere All at Once

TL;DR

Human3R addresses online, world-grounded 4D reconstruction of multiple humans and scenes from monocular video by a single, feed-forward model. It finetunes a compact visual-prompt layer on top of the CUT3R 4D foundation to jointly predict SMPL-X human meshes, dense scene geometry, and camera poses in real time, using head-token prompts and a human-prior encoder to read out multiple bodies. Trained on the BEDLAM synthetic dataset in one day on a single 48GB GPU, it achieves 15 FPS with an 8 GB memory footprint and demonstrates state-of-the-art or competitive results across local mesh recovery, global motion estimation, depth, and camera pose tasks within a single unified model. This framework advances real-time human-scene understanding with minimal dependencies and opens avenues for AR/VR, autonomous navigation, and human-robot interaction by providing a simple yet powerful baseline for end-to-end 4D reconstruction.

Abstract

We present Human3R, a unified, feed-forward framework for online 4D human-scene reconstruction, in the world frame, from casually captured monocular videos. Unlike previous approaches that rely on multi-stage pipelines, iterative contact-aware refinement between humans and scenes, and heavy dependencies, e.g., human detection, depth estimation, and SLAM pre-processing, Human3R jointly recovers global multi-person SMPL-X bodies ("everyone"), dense 3D scene ("everywhere"), and camera trajectories in a single forward pass ("all-at-once"). Our method builds upon the 4D online reconstruction model CUT3R, and uses parameter-efficient visual prompt tuning, to strive to preserve CUT3R's rich spatiotemporal priors, while enabling direct readout of multiple SMPL-X bodies. Human3R is a unified model that eliminates heavy dependencies and iterative refinement. After being trained on the relatively small-scale synthetic dataset BEDLAM for just one day on one GPU, it achieves superior performance with remarkable efficiency: it reconstructs multiple humans in a one-shot manner, along with 3D scenes, in one stage, at real-time speed (15 FPS) with a low memory footprint (8 GB). Extensive experiments demonstrate that Human3R delivers state-of-the-art or competitive performance across tasks, including global human motion estimation, local human mesh recovery, video depth estimation, and camera pose estimation, with a single unified model. We hope that Human3R will serve as a simple yet strong baseline, be easily extended for downstream applications.Code available in https://fanegg.github.io/Human3R

Paper Structure

This paper contains 22 sections, 3 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Given a stream of RGB images as input, Human3R enables human-scene reconstruction in an online, continuous manner, estimating global multi-person meshes, camera parameters, and dense scene geometry with each incoming frame in real time.
  • Figure 2: Human behaviors (i.e., grocery shopping) become clearer when viewed within their surrounding environment.
  • Figure 3: Multi-stage vs. One-stage.
  • Figure 4: Method Overview. Human3R enables online human-scene reconstruction from video streams. Each frame is encoded into image tokens, with patch-level detection. Each detected head token, concatenated with a human prior token from Multi-HMR Multi-HMR ViT-DINO feature, is projected into a human prompt. The human prompts serve as discriminative human-ID queries for the decoder: they self-attend with image tokens to aggregate spatial whole-body information and cross-attend with the scene state to retrieve temporally consistent human tokens within the 3D scene context. Only human-related layers are fine-tuned, other parameters remain frozen and are initialized from CUT3R cut3r.
  • Figure 5: Detection and Segmentation.
  • ...and 11 more figures