Table of Contents
Fetching ...

Reconstructing People, Places, and Cameras

Lea Müller, Hongsuk Choi, Anthony Zhang, Brent Yi, Jitendra Malik, Angjoo Kanazawa

TL;DR

HSfM addresses the challenge of reconstructing humans, scene geometry, and camera poses in a metric world from sparse uncalibrated multi-view images. By jointly optimizing SMPL-X human meshes, scene pointmaps, and camera parameters with a global alignment loss and 2D keypoint-driven bundle adjustment, it yields metric-scale reconstructions where people are grounded in the environment. The method leverages human size priors to constrain scale and uses data-driven initializations to stabilize optimization, achieving substantial improvements in human localization and camera pose accuracy on EgoHumans and EgoExo4D. This integrated approach advances practical 3D understanding of real-world scenes with interacting people, enabling more accurate scene reconstructions and reliable camera calibration without explicit scene-contact constraints.

Abstract

We present "Humans and Structure from Motion" (HSfM), a method for jointly reconstructing multiple human meshes, scene point clouds, and camera parameters in a metric world coordinate system from a sparse set of uncalibrated multi-view images featuring people. Our approach combines data-driven scene reconstruction with the traditional Structure-from-Motion (SfM) framework to achieve more accurate scene reconstruction and camera estimation, while simultaneously recovering human meshes. In contrast to existing scene reconstruction and SfM methods that lack metric scale information, our method estimates approximate metric scale by leveraging a human statistical model. Furthermore, it reconstructs multiple human meshes within the same world coordinate system alongside the scene point cloud, effectively capturing spatial relationships among individuals and their positions in the environment. We initialize the reconstruction of humans, scenes, and cameras using robust foundational models and jointly optimize these elements. This joint optimization synergistically improves the accuracy of each component. We compare our method to existing approaches on two challenging benchmarks, EgoHumans and EgoExo4D, demonstrating significant improvements in human localization accuracy within the world coordinate frame (reducing error from 3.51m to 1.04m in EgoHumans and from 2.9m to 0.56m in EgoExo4D). Notably, our results show that incorporating human data into the SfM pipeline improves camera pose estimation (e.g., increasing RRA@15 by 20.3% on EgoHumans). Additionally, qualitative results show that our approach improves overall scene reconstruction quality. Our code is available at: https://github.com/hongsukchoi/HSfM_RELEASE

Reconstructing People, Places, and Cameras

TL;DR

HSfM addresses the challenge of reconstructing humans, scene geometry, and camera poses in a metric world from sparse uncalibrated multi-view images. By jointly optimizing SMPL-X human meshes, scene pointmaps, and camera parameters with a global alignment loss and 2D keypoint-driven bundle adjustment, it yields metric-scale reconstructions where people are grounded in the environment. The method leverages human size priors to constrain scale and uses data-driven initializations to stabilize optimization, achieving substantial improvements in human localization and camera pose accuracy on EgoHumans and EgoExo4D. This integrated approach advances practical 3D understanding of real-world scenes with interacting people, enabling more accurate scene reconstructions and reliable camera calibration without explicit scene-contact constraints.

Abstract

We present "Humans and Structure from Motion" (HSfM), a method for jointly reconstructing multiple human meshes, scene point clouds, and camera parameters in a metric world coordinate system from a sparse set of uncalibrated multi-view images featuring people. Our approach combines data-driven scene reconstruction with the traditional Structure-from-Motion (SfM) framework to achieve more accurate scene reconstruction and camera estimation, while simultaneously recovering human meshes. In contrast to existing scene reconstruction and SfM methods that lack metric scale information, our method estimates approximate metric scale by leveraging a human statistical model. Furthermore, it reconstructs multiple human meshes within the same world coordinate system alongside the scene point cloud, effectively capturing spatial relationships among individuals and their positions in the environment. We initialize the reconstruction of humans, scenes, and cameras using robust foundational models and jointly optimize these elements. This joint optimization synergistically improves the accuracy of each component. We compare our method to existing approaches on two challenging benchmarks, EgoHumans and EgoExo4D, demonstrating significant improvements in human localization accuracy within the world coordinate frame (reducing error from 3.51m to 1.04m in EgoHumans and from 2.9m to 0.56m in EgoExo4D). Notably, our results show that incorporating human data into the SfM pipeline improves camera pose estimation (e.g., increasing RRA@15 by 20.3% on EgoHumans). Additionally, qualitative results show that our approach improves overall scene reconstruction quality. Our code is available at: https://github.com/hongsukchoi/HSfM_RELEASE

Paper Structure

This paper contains 17 sections, 10 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Humans and Structure from Motion (HSfM). We propose a method for the joint reconstruction of humans, scene point clouds, and cameras from an uncalibrated, sparse set of images depicting people. By explicitly incorporating humans into the traditional Structure from Motion (SfM) framework through 2D human keypoint correspondences and leveraging robust initialization from an off-the-shelf model for scene and camera reconstruction, our approach demonstrates that integrating these three elements—people, scenes, and cameras—synergistically improves the reconstruction accuracy of each component. Unlike prior work in SfM and human pose estimation, our method reconstructs metric-scale scene point clouds and camera parameters, informed by human mesh predictions, while situating human meshes in coherent world coordinates consistent with the surrounding environment without any explicit contact constraints.
  • Figure 2: Pipeline of Humans and Structure from Motion. Our method processes synchronized images from an uncalibrated multi-view camera setup with known person correspondences across views. We utilize pretrained networks to estimate 2D human keypoints per image xu2022vitpose, 3D human mesh goel2023humans, scene point clouds in a pointmap representation, and camera intrinsic and extrinsic parameters wang2024dust3r. We first initialize these estimates in a common world coordinate system by recovering the scene scale $\alpha$ and human locations (global translation in the world coordinate) $\gamma$, as described in Section \ref{['subsection:scale_init']}. We then jointly optimize humans, the scene, and cameras using bundle adjustment based on 2D human keypoints, 3D human meshes, and a global alignment loss that merges per-view pointmaps into the same world space.
  • Figure 3: Qualitative results from HSfM. We show our optimized result on sequences from EgoHumans (top) and EgoExo4D (bottom). Note how in the Initial state (left) people are floating in the air (a), how the scene and human scale is not aligned (e), and how noisy the scene appears (c). Our method resolves these problems by grounding people in the scene (b), recovering plausible metric scale (f), and better camera estimates (d). We achieve this without scene contact constraints, which often require assumptions about the environment—such as flat terrain—or about motion, such as the assumption that humans are always in contact with the ground (i.e., no jumping). For more qualitative results, including a demo on images taken in the wild with a minimal capturing setup, please see our supplementary material.
  • Figure S.1: Qualitative results in the wild. We show reconstructions on in-the-wild images taken with two smartphones (a), demonstrating the reconstruction of humans and scenes. Unlike previous works zou2020reducingye2023decoupling, which adopt human-scene contact priors that hinder generalization to scenarios without ground foot contact, HSfM recovers accurate world locations of the human meshes that are coherent with the static scene structure. The use of humans in our framework (c) not only serves as a reliable initialization for 3D structure in the SfM formulation but also provides more faithful and complete information about people in the world, which a noisy human point cloud (b) cannot offer. For visualization purposes, the human point cloud is removed using SAM2 ravi2024sam2.
  • Figure S.2: Qualitative results in the wild. We show reconstructions on in-the-wild images taken with two cell phones and the reconstruction of humans and scene. Our method places people in the world and reconstructs accurate human-scene contact, e.g. between the person's right foot and box.
  • ...and 6 more figures