Reconstructing People, Places, and Cameras

Lea Müller; Hongsuk Choi; Anthony Zhang; Brent Yi; Jitendra Malik; Angjoo Kanazawa

Reconstructing People, Places, and Cameras

Lea Müller, Hongsuk Choi, Anthony Zhang, Brent Yi, Jitendra Malik, Angjoo Kanazawa

TL;DR

HSfM addresses the challenge of reconstructing humans, scene geometry, and camera poses in a metric world from sparse uncalibrated multi-view images. By jointly optimizing SMPL-X human meshes, scene pointmaps, and camera parameters with a global alignment loss and 2D keypoint-driven bundle adjustment, it yields metric-scale reconstructions where people are grounded in the environment. The method leverages human size priors to constrain scale and uses data-driven initializations to stabilize optimization, achieving substantial improvements in human localization and camera pose accuracy on EgoHumans and EgoExo4D. This integrated approach advances practical 3D understanding of real-world scenes with interacting people, enabling more accurate scene reconstructions and reliable camera calibration without explicit scene-contact constraints.

Abstract

We present "Humans and Structure from Motion" (HSfM), a method for jointly reconstructing multiple human meshes, scene point clouds, and camera parameters in a metric world coordinate system from a sparse set of uncalibrated multi-view images featuring people. Our approach combines data-driven scene reconstruction with the traditional Structure-from-Motion (SfM) framework to achieve more accurate scene reconstruction and camera estimation, while simultaneously recovering human meshes. In contrast to existing scene reconstruction and SfM methods that lack metric scale information, our method estimates approximate metric scale by leveraging a human statistical model. Furthermore, it reconstructs multiple human meshes within the same world coordinate system alongside the scene point cloud, effectively capturing spatial relationships among individuals and their positions in the environment. We initialize the reconstruction of humans, scenes, and cameras using robust foundational models and jointly optimize these elements. This joint optimization synergistically improves the accuracy of each component. We compare our method to existing approaches on two challenging benchmarks, EgoHumans and EgoExo4D, demonstrating significant improvements in human localization accuracy within the world coordinate frame (reducing error from 3.51m to 1.04m in EgoHumans and from 2.9m to 0.56m in EgoExo4D). Notably, our results show that incorporating human data into the SfM pipeline improves camera pose estimation (e.g., increasing RRA@15 by 20.3% on EgoHumans). Additionally, qualitative results show that our approach improves overall scene reconstruction quality. Our code is available at: https://github.com/hongsukchoi/HSfM_RELEASE

Reconstructing People, Places, and Cameras

TL;DR

Abstract

Reconstructing People, Places, and Cameras

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)