MorpheuS: Neural Dynamic 360° Surface Reconstruction from Monocular RGB-D Video
Hengyi Wang, Jingwen Wang, Lourdes Agapito
TL;DR
MorpheuS tackles the challenge of reconstructing accurate geometry and vivid appearance for dynamic scenes from casual monocular RGB-D video by decoupling deformation from a hyper-dimensional canonical field and leveraging a diffusion-prior for realistic completion of unobserved regions. The method warps observed points into a canonical space and uses a hash-encoded SDF/color field to enable 360° rendering, while distilling knowledge from a view-conditioned diffusion model via Score Distillation Sampling. Its optimization integrates real-view supervision with diffusion-based priors and regularizations in both canonical and parameter spaces, augmented by temporal conditioning and view-aware weighting to stabilize training. Empirical results on real and synthetic datasets show improved geometry accuracy, complete unobserved regions with realistic textures, and strong novel-view synthesis, highlighting the practical impact for robust, model-agnostic dynamic scene reconstruction. The work advances neural rendering for dynamic scenes by combining canonical-space regularization with diffusion priors to achieve high-fidelity, full-cycle reconstructions from casual RGB-D input.
Abstract
Neural rendering has demonstrated remarkable success in dynamic scene reconstruction. Thanks to the expressiveness of neural representations, prior works can accurately capture the motion and achieve high-fidelity reconstruction of the target object. Despite this, real-world video scenarios often feature large unobserved regions where neural representations struggle to achieve realistic completion. To tackle this challenge, we introduce MorpheuS, a framework for dynamic 360° surface reconstruction from a casually captured RGB-D video. Our approach models the target scene as a canonical field that encodes its geometry and appearance, in conjunction with a deformation field that warps points from the current frame to the canonical space. We leverage a view-dependent diffusion prior and distill knowledge from it to achieve realistic completion of unobserved regions. Experimental results on various real-world and synthetic datasets show that our method can achieve high-fidelity 360° surface reconstruction of a deformable object from a monocular RGB-D video.
