Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass

Sangmin Kim; Minhyuk Hwang; Geonho Cha; Dongyoon Wee; Jaesik Park

Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass

Sangmin Kim, Minhyuk Hwang, Geonho Cha, Dongyoon Wee, Jaesik Park

Abstract

Recent advances in 3D foundation models have led to growing interest in reconstructing humans and their surrounding environments. However, most existing approaches focus on monocular inputs, and extending them to multi-view settings requires additional overhead modules or preprocessed data. To this end, we present CHROMM, a unified framework that jointly estimates cameras, scene point clouds, and human meshes from multi-person multi-view videos without relying on external modules or preprocessing. We integrate strong geometric and human priors from Pi3X and Multi-HMR into a single trainable neural network architecture, and introduce a scale adjustment module to solve the scale discrepancy between humans and the scene. We also introduce a multi-view fusion strategy to aggregate per-view estimates into a single representation at test-time. Finally, we propose a geometry-based multi-person association method, which is more robust than appearance-based approaches. Experiments on EMDB, RICH, EgoHumans, and EgoExo4D show that CHROMM achieves competitive performance in global human motion and multi-view pose estimation while running over 8x faster than prior optimization-based multi-view approaches. Project page: https://nstar1125.github.io/chromm.

Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass

Abstract

Paper Structure (21 sections, 7 equations, 10 figures, 12 tables, 1 algorithm)

This paper contains 21 sections, 7 equations, 10 figures, 12 tables, 1 algorithm.

Introduction
Related Work
3D Scene Reconstruction
Human Mesh Recovery
Human-Scene Reconstruction
Method
Model Architecture
Multi-View Fusion
Multi-Person Association
Training CHROMM
Experiments
Experimental Setup
Global Human Motion Estimation
Multi-View Human Pose Estimation
Qualitative Result
...and 6 more sections

Figures (10)

Figure 1: Given multi-person, multi-view videos, our proposed approach, CHROMM, reconstructs cameras, scene point cloud, and human meshes in a single pass.
Figure 2: Overview of our pipeline. Each frame is encoded by the Pi3 encoder and the Multi-HMR encoder. The Pi3 features are decoded to reconstruct the scene. Head tokens detected from Multi-HMR features are fused with the corresponding tokens from the Pi3 decoder tokens to predict SMPL parameters. At test time, we associate persons across views and fuse them into a global representation, followed by a scale adjustment module to align humans and the scene.
Figure 3: (Left) Pelvis location is predicted in a coarse-to-fine manner. (Right) Scale is adjusted using the head–pelvis length ratio between the image and projected SMPL.
Figure 4: Qualitative results on EgoBody, EgoHumans, EgoExo4D, and EMDB datasets.
Figure 5: Effect of scale adjustment.
...and 5 more figures

Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass

Abstract

Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass

Authors

Abstract

Table of Contents

Figures (10)