Table of Contents
Fetching ...

Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses

Inhee Lee, Byungjun Kim, Hanbyul Joo

TL;DR

This work tackles 4D reconstruction of dynamic scenes from monocular video with sparse observations by representing both the static background and multiple humans using 3D Gaussian Splatting (3D-GS) in a common framework. It canonicalizes each human in SMPL-based space and uses Score Distillation Sampling (SDS) with diffusion priors, Texture Inversion, and ControlNet to synthesize unseen views while maintaining observed identity, enabling animatable avatars and scene editing. Key contributions include (1) a unified 3D-GS background+multi-human representation, (2) canonical-space fusion guided by diffusion priors for sparse observations, (3) an efficient 4D reconstruction/editing pipeline with strong quantitative and qualitative results across challenging scenes. The approach delivers high-quality, view-consistent renderings and enables editing motions of each human, offering practical impact for monocular 4D capture and interactive scene manipulation.

Abstract

In this paper, we present a method to reconstruct the world and multiple dynamic humans in 3D from a monocular video input. As a key idea, we represent both the world and multiple humans via the recently emerging 3D Gaussian Splatting (3D-GS) representation, enabling to conveniently and efficiently compose and render them together. In particular, we address the scenarios with severely limited and sparse observations in 3D human reconstruction, a common challenge encountered in the real world. To tackle this challenge, we introduce a novel approach to optimize the 3D-GS representation in a canonical space by fusing the sparse cues in the common space, where we leverage a pre-trained 2D diffusion model to synthesize unseen views while keeping the consistency with the observed 2D appearances. We demonstrate our method can reconstruct high-quality animatable 3D humans in various challenging examples, in the presence of occlusion, image crops, few-shot, and extremely sparse observations. After reconstruction, our method is capable of not only rendering the scene in any novel views at arbitrary time instances, but also editing the 3D scene by removing individual humans or applying different motions for each human. Through various experiments, we demonstrate the quality and efficiency of our methods over alternative existing approaches.

Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses

TL;DR

This work tackles 4D reconstruction of dynamic scenes from monocular video with sparse observations by representing both the static background and multiple humans using 3D Gaussian Splatting (3D-GS) in a common framework. It canonicalizes each human in SMPL-based space and uses Score Distillation Sampling (SDS) with diffusion priors, Texture Inversion, and ControlNet to synthesize unseen views while maintaining observed identity, enabling animatable avatars and scene editing. Key contributions include (1) a unified 3D-GS background+multi-human representation, (2) canonical-space fusion guided by diffusion priors for sparse observations, (3) an efficient 4D reconstruction/editing pipeline with strong quantitative and qualitative results across challenging scenes. The approach delivers high-quality, view-consistent renderings and enables editing motions of each human, offering practical impact for monocular 4D capture and interactive scene manipulation.

Abstract

In this paper, we present a method to reconstruct the world and multiple dynamic humans in 3D from a monocular video input. As a key idea, we represent both the world and multiple humans via the recently emerging 3D Gaussian Splatting (3D-GS) representation, enabling to conveniently and efficiently compose and render them together. In particular, we address the scenarios with severely limited and sparse observations in 3D human reconstruction, a common challenge encountered in the real world. To tackle this challenge, we introduce a novel approach to optimize the 3D-GS representation in a canonical space by fusing the sparse cues in the common space, where we leverage a pre-trained 2D diffusion model to synthesize unseen views while keeping the consistency with the observed 2D appearances. We demonstrate our method can reconstruct high-quality animatable 3D humans in various challenging examples, in the presence of occlusion, image crops, few-shot, and extremely sparse observations. After reconstruction, our method is capable of not only rendering the scene in any novel views at arbitrary time instances, but also editing the 3D scene by removing individual humans or applying different motions for each human. Through various experiments, we demonstrate the quality and efficiency of our methods over alternative existing approaches.
Paper Structure (30 sections, 14 equations, 9 figures, 6 tables)

This paper contains 30 sections, 14 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: We present a method to reconstruct dynamic scenes from a monocular video capturing partial 2D observations. As a key advantage, our method can estimate the unseen body parts by leveraging a pre-trained diffusion model rombach2022stablediffusion_cvpr via SDS method poole2022dreamfusion. The reconstructed scenes can be rendered to any viewpoint and each human body can be transformed into any body posture controlled by SMPL smpl2015 parameters.
  • Figure 2: Method overview. Overview of our pipeline. (Sec. \ref{['sec:method']}).
  • Figure 3: Failure examples of optimizing 3D-GS naively. (a) shows that naively optimizing 3D-GS suffers from artifacts shaped like a hedgehog in unseen view and input view. (b) shows that our SDS loss effectively removes the artifacts observed in both input and unseen views.
  • Figure 4: Novel view synthesized results of Panoptic Dataset panoptic_tpami
  • Figure 5: Novel view synthesis output of Hi4D pair00-dance sequence. While HumanNeRFWeng2022humannerf fails to reconstruct a face, ours synthesizes a plausible face guided by diffusion model rombach2022stablediffusion_cvpr. (d) plots camera position relative to the front viewing female body. As shown here, the majority of the rendered output shown here has been never observed in the train view.
  • ...and 4 more figures