Table of Contents
Fetching ...

Simultaneously Recovering Multi-Person Meshes and Multi-View Cameras with Human Semantics

Buzhen Huang, Jingyi Ju, Yuan Shu, Yangang Wang

TL;DR

This paper tackles simultaneous recovery of multiple human meshes and camera parameters from uncalibrated multi-view video. It introduces a calibration-free pipeline that (i) initializes intrinsics and extrinsics from upright human cues, (ii) builds robust cross-view associations via pose-geometry consistency, (iii) employs a compact VAE-based latent motion prior with a bidirectional GRU and a local linear constraint to stabilize optimization, and (iv) jointly optimizes motions and camera parameters under a data-term, prior, and collision penalty framework. The approach achieves accurate camera calibration and coherent, multi-person motion capture in one step, outperforming calibrated baselines in many settings and enabling scalable, calibration-light multi-person mesh recovery. Key contributions include intrinsic-extrinsic initialization from human cues, a robust pose-geometry association to link views without manual identity labeling, and a motion-prior-driven optimization that handles variable-length sequences and occlusions. The work has practical implications for sports broadcasting, VR, and game development where rapid, calibration-free multi-person capture is valuable.

Abstract

Dynamic multi-person mesh recovery has broad applications in sports broadcasting, virtual reality, and video games. However, current multi-view frameworks rely on a time-consuming camera calibration procedure. In this work, we focus on multi-person motion capture with uncalibrated cameras, which mainly faces two challenges: one is that inter-person interactions and occlusions introduce inherent ambiguities for both camera calibration and motion capture; the other is that a lack of dense correspondences can be used to constrain sparse camera geometries in a dynamic multi-person scene. Our key idea is to incorporate motion prior knowledge to simultaneously estimate camera parameters and human meshes from noisy human semantics. We first utilize human information from 2D images to initialize intrinsic and extrinsic parameters. Thus, the approach does not rely on any other calibration tools or background features. Then, a pose-geometry consistency is introduced to associate the detected humans from different views. Finally, a latent motion prior is proposed to refine the camera parameters and human motions. Experimental results show that accurate camera parameters and human motions can be obtained through a one-step reconstruction. The code are publicly available at~\url{https://github.com/boycehbz/DMMR}.

Simultaneously Recovering Multi-Person Meshes and Multi-View Cameras with Human Semantics

TL;DR

This paper tackles simultaneous recovery of multiple human meshes and camera parameters from uncalibrated multi-view video. It introduces a calibration-free pipeline that (i) initializes intrinsics and extrinsics from upright human cues, (ii) builds robust cross-view associations via pose-geometry consistency, (iii) employs a compact VAE-based latent motion prior with a bidirectional GRU and a local linear constraint to stabilize optimization, and (iv) jointly optimizes motions and camera parameters under a data-term, prior, and collision penalty framework. The approach achieves accurate camera calibration and coherent, multi-person motion capture in one step, outperforming calibrated baselines in many settings and enabling scalable, calibration-light multi-person mesh recovery. Key contributions include intrinsic-extrinsic initialization from human cues, a robust pose-geometry association to link views without manual identity labeling, and a motion-prior-driven optimization that handles variable-length sequences and occlusions. The work has practical implications for sports broadcasting, VR, and game development where rapid, calibration-free multi-person capture is valuable.

Abstract

Dynamic multi-person mesh recovery has broad applications in sports broadcasting, virtual reality, and video games. However, current multi-view frameworks rely on a time-consuming camera calibration procedure. In this work, we focus on multi-person motion capture with uncalibrated cameras, which mainly faces two challenges: one is that inter-person interactions and occlusions introduce inherent ambiguities for both camera calibration and motion capture; the other is that a lack of dense correspondences can be used to constrain sparse camera geometries in a dynamic multi-person scene. Our key idea is to incorporate motion prior knowledge to simultaneously estimate camera parameters and human meshes from noisy human semantics. We first utilize human information from 2D images to initialize intrinsic and extrinsic parameters. Thus, the approach does not rely on any other calibration tools or background features. Then, a pose-geometry consistency is introduced to associate the detected humans from different views. Finally, a latent motion prior is proposed to refine the camera parameters and human motions. Experimental results show that accurate camera parameters and human motions can be obtained through a one-step reconstruction. The code are publicly available at~\url{https://github.com/boycehbz/DMMR}.

Paper Structure

This paper contains 15 sections, 19 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Overview of our method. Since directly optimizing cameras and human motions from noisy detections (a) always leads to suboptimal solutions, we first initialize the cameras with human cues. Next, we introduce a pose-geometry consistent association (b) to establish cross-view and temporal correspondences for the detected human semantics. We further train a latent motion prior for the optimization to obtain accurate camera parameters and coherent human motions from the associated inputs (d).
  • Figure 2: We show the pipeline of our method and the relationships between different modules. With the aid of a motion prior, our method can simultaneously recover precise camera parameters and human meshes from detected human semantics.
  • Figure 3: The motion prior is a symmetrical encoder-decoder network, which compactly models human dynamics and kinematics. The prior can be trained on short clips and used to fit long sequences.
  • Figure 4: Qualitative comparison with multi-view methods on Campus (Row 1) and Panoptic (Row 2) datasets. Campus captures humans in a large scene (we zoom in for better visualization). DMMR cannot reconstruct humans in the distance, and MvPose dong2021fast also fails on these cases due to the mismatched 2D pose and the lack of prior knowledge.
  • Figure 5: We show more results on different datasets. Our method can estimate accurate cameras and motions in a one-step reconstruction.
  • ...and 6 more figures