Table of Contents
Fetching ...

Joint Optimization for 4D Human-Scene Reconstruction in the Wild

Zhizheng Liu, Joe Lin, Wayne Wu, Bolei Zhou

TL;DR

This work tackles 4D human-scene reconstruction from monocular web videos by jointly optimizing global human motion and the surrounding scene, grounded through explicit human-scene contact constraints. It introduces JOSH, an optimization-based framework that initializes from dense scene reconstruction and SMPL-based human meshes and then refines both scene geometry and motion while estimating camera parameters. To enable real-time inference, it also presents JOSH3R, a lightweight end-to-end model trained with pseudo-labels produced by JOSH on web videos. Empirically, JOSH achieves state-of-the-art results on global motion estimation and dense scene reconstruction across EMDB, SLOPER4D, and RICH, while JOSH3R offers competitive accuracy with substantially higher speed, demonstrating strong generalization to web data. The approach advances scalable, ground-truth grounded analysis of in-the-wild human-scene interactions and provides a practical path toward large-scale, video-driven datasets.

Abstract

Reconstructing human motion and its surrounding environment is crucial for understanding human-scene interaction and predicting human movements in the scene. While much progress has been made in capturing human-scene interaction in constrained environments, those prior methods can hardly reconstruct the natural and diverse human motion and scene context from web videos. In this work, we propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos. JOSH uses techniques in both dense scene reconstruction and human mesh recovery as initialization, and then it leverages the human-scene contact constraints to jointly optimize the scene, the camera poses, and the human motion. Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction by joint optimization of scene geometry and human motion. We further design a more efficient model, JOSH3R, and directly train it with pseudo-labels from web videos. JOSH3R outperforms other optimization-free methods by only training with labels predicted from JOSH, further demonstrating its accuracy and generalization ability.

Joint Optimization for 4D Human-Scene Reconstruction in the Wild

TL;DR

This work tackles 4D human-scene reconstruction from monocular web videos by jointly optimizing global human motion and the surrounding scene, grounded through explicit human-scene contact constraints. It introduces JOSH, an optimization-based framework that initializes from dense scene reconstruction and SMPL-based human meshes and then refines both scene geometry and motion while estimating camera parameters. To enable real-time inference, it also presents JOSH3R, a lightweight end-to-end model trained with pseudo-labels produced by JOSH on web videos. Empirically, JOSH achieves state-of-the-art results on global motion estimation and dense scene reconstruction across EMDB, SLOPER4D, and RICH, while JOSH3R offers competitive accuracy with substantially higher speed, demonstrating strong generalization to web data. The approach advances scalable, ground-truth grounded analysis of in-the-wild human-scene interactions and provides a practical path toward large-scale, video-driven datasets.

Abstract

Reconstructing human motion and its surrounding environment is crucial for understanding human-scene interaction and predicting human movements in the scene. While much progress has been made in capturing human-scene interaction in constrained environments, those prior methods can hardly reconstruct the natural and diverse human motion and scene context from web videos. In this work, we propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos. JOSH uses techniques in both dense scene reconstruction and human mesh recovery as initialization, and then it leverages the human-scene contact constraints to jointly optimize the scene, the camera poses, and the human motion. Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction by joint optimization of scene geometry and human motion. We further design a more efficient model, JOSH3R, and directly train it with pseudo-labels from web videos. JOSH3R outperforms other optimization-free methods by only training with labels predicted from JOSH, further demonstrating its accuracy and generalization ability.
Paper Structure (29 sections, 5 equations, 6 figures, 6 tables)

This paper contains 29 sections, 5 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overview of JOSH. Given an input video: (a). JOSH first initializes the local scene reconstruction and local SMPL meshes from pre-trained models. (b). JOSH then jointly optimizes the dense scene point cloud and the global human motion with the human-scene contact losses inferred from the contact labels and the other losses to predict 4D human-scene reconstruction.
  • Figure 2: Architecture of JOSH3R. Given an input image pair, JOSH3R uses the frozen MASt3R encoder and decoders to obtain the feature map of each image. The human trajectory head then processes the image feature in the mask region with a mask encoder to predict the relative human transformation between the two images.
  • Figure 3: Qualitative 4D human-scene reconstruction results. a): Qualitative results on the RICH huang2022capturing dataset. We compare the ground truth reconstruction with JOSH with joint optimization and the MonST3R$^\star$ baseline without joint optimization. JOSH has better reconstruction quality and consistency in both global human motion and dense scene reconstruction. b). Qualitative results of web videos. JOSH can reconstruct the motion of multiple people and their surrounding environment in the wild.
  • Figure 4: Qualitative comparisons for large-scale dense scene reconstruction. We use 'seq003' from the SLOPER4D dataset with 6495 frames and a looping camera trajectory. We visualize the reconstructed dense scene point cloud and the camera trajectory on a metric scale. JOSH has the best accuracy compared to the other baselines, while the MASt3R$^\star$ has the wrong global scale and MonST3R$^\star$ fails to produce a consistent reconstruction.
  • Figure 5: Web videos annotated by JOSH. The first row shows the initial frame of the sequence with pedestrians walking in urban scenes. The second row shows pseudo-labels of the global human motion predicted by JOSH by projecting the future motion to the initial frame.
  • ...and 1 more figures