4D3R: Motion-Aware Neural Reconstruction and Rendering of Dynamic Scenes from Monocular Videos

Mengqi Guo; Bo Xu; Yanyan Li; Gim Hee Lee

4D3R: Motion-Aware Neural Reconstruction and Rendering of Dynamic Scenes from Monocular Videos

Mengqi Guo, Bo Xu, Yanyan Li, Gim Hee Lee

TL;DR

4D3R tackles monocular dynamic scene novel view synthesis without known camera poses by integrating a motion-aware pose estimation and reconstruction pipeline. It combines 4D-aware information extraction, a Motion-Aware Bundle Adjustment, and a Motion-Aware Gaussian Splatting representation in a two-stage optimization, enabling pose-free rendering with dynamic objects. The approach yields up to 1.8 dB PSNR improvements and 5x faster training compared to COLMAP-dependent methods, while maintaining high quality and efficiency on challenging real-world sequences. This work significantly advances practical monocular dynamic scene reconstruction, offering a scalable framework for AR/VR and remote collaboration with reduced computational demands.

Abstract

Novel view synthesis from monocular videos of dynamic scenes with unknown camera poses remains a fundamental challenge in computer vision and graphics. While recent advances in 3D representations such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have shown promising results for static scenes, they struggle with dynamic content and typically rely on pre-computed camera poses. We present 4D3R, a pose-free dynamic neural rendering framework that decouples static and dynamic components through a two-stage approach. Our method first leverages 3D foundational models for initial pose and geometry estimation, followed by motion-aware refinement. 4D3R introduces two key technical innovations: (1) a motion-aware bundle adjustment (MA-BA) module that combines transformer-based learned priors with SAM2 for robust dynamic object segmentation, enabling more accurate camera pose refinement; and (2) an efficient Motion-Aware Gaussian Splatting (MA-GS) representation that uses control points with a deformation field MLP and linear blend skinning to model dynamic motion, significantly reducing computational cost while maintaining high-quality reconstruction. Extensive experiments on real-world dynamic datasets demonstrate that our approach achieves up to 1.8dB PSNR improvement over state-of-the-art methods, particularly in challenging scenarios with large dynamic objects, while reducing computational requirements by 5x compared to previous dynamic scene representations.

4D3R: Motion-Aware Neural Reconstruction and Rendering of Dynamic Scenes from Monocular Videos

TL;DR

Abstract

4D3R: Motion-Aware Neural Reconstruction and Rendering of Dynamic Scenes from Monocular Videos

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)