Table of Contents
Fetching ...

MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild

Zeren Jiang, Chen Guo, Manuel Kaufmann, Tianjian Jiang, Julien Valentin, Otmar Hilliges, Jie Song

TL;DR

MultiPly tackles the problem of reconstructing multiple people in high-fidelity 3D from monocular in-the-wild videos. It introduces a layered neural scene representation with layer-wise differentiable volume rendering, a hybrid instance segmentation strategy that combines self-supervised 3D decomposition and progressive SAM prompts, and a confidence-guided alternating optimization to produce temporally coherent pose, shape, and appearance for all subjects. The approach demonstrates robust performance on challenging datasets (Hi4D and MMM), surpassing prior methods in reconstruction quality, novel view synthesis, segmentation accuracy, and pose estimation under occlusions and close interactions. The work advances practical monocular multi-person capture with potential applications in AR/VR, telepresence, and social activity replay, while acknowledging limitations in scalability and hand modeling for future improvement.

Abstract

We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. Reconstructing multiple individuals moving and interacting naturally from monocular in-the-wild videos poses a challenging task. Addressing it necessitates precise pixel-level disentanglement of individuals without any prior knowledge about the subjects. Moreover, it requires recovering intricate and complete 3D human shapes from short video sequences, intensifying the level of difficulty. To tackle these challenges, we first define a layered neural representation for the entire scene, composited by individual human and background models. We learn the layered neural representation from videos via our layer-wise differentiable volume rendering. This learning process is further enhanced by our hybrid instance segmentation approach which combines the self-supervised 3D segmentation and the promptable 2D segmentation module, yielding reliable instance segmentation supervision even under close human interaction. A confidence-guided optimization formulation is introduced to optimize the human poses and shape/appearance alternately. We incorporate effective objectives to refine human poses via photometric information and impose physically plausible constraints on human dynamics, leading to temporally consistent 3D reconstructions with high fidelity. The evaluation of our method shows the superiority over prior art on publicly available datasets and in-the-wild videos.

MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild

TL;DR

MultiPly tackles the problem of reconstructing multiple people in high-fidelity 3D from monocular in-the-wild videos. It introduces a layered neural scene representation with layer-wise differentiable volume rendering, a hybrid instance segmentation strategy that combines self-supervised 3D decomposition and progressive SAM prompts, and a confidence-guided alternating optimization to produce temporally coherent pose, shape, and appearance for all subjects. The approach demonstrates robust performance on challenging datasets (Hi4D and MMM), surpassing prior methods in reconstruction quality, novel view synthesis, segmentation accuracy, and pose estimation under occlusions and close interactions. The work advances practical monocular multi-person capture with potential applications in AR/VR, telepresence, and social activity replay, while acknowledging limitations in scalability and hand modeling for future improvement.

Abstract

We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. Reconstructing multiple individuals moving and interacting naturally from monocular in-the-wild videos poses a challenging task. Addressing it necessitates precise pixel-level disentanglement of individuals without any prior knowledge about the subjects. Moreover, it requires recovering intricate and complete 3D human shapes from short video sequences, intensifying the level of difficulty. To tackle these challenges, we first define a layered neural representation for the entire scene, composited by individual human and background models. We learn the layered neural representation from videos via our layer-wise differentiable volume rendering. This learning process is further enhanced by our hybrid instance segmentation approach which combines the self-supervised 3D segmentation and the promptable 2D segmentation module, yielding reliable instance segmentation supervision even under close human interaction. A confidence-guided optimization formulation is introduced to optimize the human poses and shape/appearance alternately. We incorporate effective objectives to refine human poses via photometric information and impose physically plausible constraints on human dynamics, leading to temporally consistent 3D reconstructions with high fidelity. The evaluation of our method shows the superiority over prior art on publicly available datasets and in-the-wild videos.
Paper Structure (25 sections, 14 equations, 6 figures, 5 tables)

This paper contains 25 sections, 14 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: We propose MultiPly, a novel framework to reconstruct multiple people in 3D from in-the-wild monocular videos. Our method can recover the complete 3D human with high-fidelity shape and appearance, even in scenarios involving occlusions and close interactions.
  • Figure 2: Method overview. Given an image and SMPL estimation, we sample human points along the camera ray based on the bounding boxes of SMPL bodies and the background points based on NeRF++. We warp sampled human points into canonical space via inverse warping and evaluate the person-specific implicit network to obtain the SDF and radiance values (Sec. \ref{['sec:representation']}). The layer-wise volume rendering is then applied to learn the implicit networks from images (Sec. \ref{['sec:volume']}). We build a closed-loop refinement for instance segmentation by dynamically updating prompts for SAM using evolving human models (Sec. \ref{['sec:sam_prompt']}). Finally, we formulate a confidence-guided optimization that only optimizes pose parameters for unreliable frames and jointly optimizes pose and implicit networks for reliable frames (Sec. \ref{['sec:delay']}).
  • Figure 3: Qualitative ablation studies. Our progressive prompting strategy provides robust instance segmentation supervision and eliminates the noises caused by the environmental dynamic effects. The confidence-guided optimization further improves the reconstruction results and maintains complete human bodies.
  • Figure 4: Qualitative reconstruction comparison. We show both the overlaid and separated reconstruction results for each method. Red bounding boxes: the incomplete reconstruction of the occluded part. Orange bounding boxes: incorrect instance segmentation results caused by the surrounding visual complexities. Black bounding boxes: inaccurate spatial arrangement due to pose error.
  • Figure 5: Qualitative rendering comparison. Our method achieves more plausible renderings with sharp boundaries.
  • ...and 1 more figures