Self-Calibrating 4D Novel View Synthesis from Monocular Videos Using Gaussian Splatting

Fang Li; Hao Zhang; Narendra Ahuja

Self-Calibrating 4D Novel View Synthesis from Monocular Videos Using Gaussian Splatting

Fang Li, Hao Zhang, Narendra Ahuja

TL;DR

This work proposes a novel approach that learns a high-fidelity 4D GS scene representation with self-calibration of camera parameters that includes the extraction of 2D point features that robustly represent 3D structure towards overall 4D scene optimization.

Abstract

Gaussian Splatting (GS) has significantly elevated scene reconstruction efficiency and novel view synthesis (NVS) accuracy compared to Neural Radiance Fields (NeRF), particularly for dynamic scenes. However, current 4D NVS methods, whether based on GS or NeRF, primarily rely on camera parameters provided by COLMAP and even utilize sparse point clouds generated by COLMAP for initialization, which lack accuracy as well are time-consuming. This sometimes results in poor dynamic scene representation, especially in scenes with large object movements, or extreme camera conditions e.g. small translations combined with large rotations. Some studies simultaneously optimize the estimation of camera parameters and scenes, supervised by additional information like depth, optical flow, etc. obtained from off-the-shelf models. Using this unverified information as ground truth can reduce robustness and accuracy, which does frequently occur for long monocular videos (with e.g. > hundreds of frames). We propose a novel approach that learns a high-fidelity 4D GS scene representation with self-calibration of camera parameters. It includes the extraction of 2D point features that robustly represent 3D structure, and their use for subsequent joint optimization of camera parameters and 3D structure towards overall 4D scene optimization. We demonstrate the accuracy and time efficiency of our method through extensive quantitative and qualitative experimental results on several standard benchmarks. The results show significant improvements over state-of-the-art methods for 4D novel view synthesis. The source code will be released soon at https://github.com/fangli333/SC-4DGS.

Self-Calibrating 4D Novel View Synthesis from Monocular Videos Using Gaussian Splatting

TL;DR

Abstract

Paper Structure (22 sections, 8 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 8 equations, 9 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Method
Preliminaries
Structural Points Extraction (SPE)
Camera Parameters & 3D Structural Points Optimization
Dynamic Scene Representations Optimization
Experiments
Evaluation of Estimated Camera Parameters
Rendering Evaluation
Why we do not use the CoTracker output directly - Ablation Study
Conclusions
Appendix
Datasets
Implementation Details
...and 7 more sections

Figures (9)

Figure 1: Overview of SC-4DGS. We take the basin data of NeRF-DS nerfds dataset as the example. First, the SPE algorithm attempts to extract 2D structural points $\mathbf{SP}^{2D}$ and 3D structural points $\mathbf{SP}^{3D}$ through $\mathbf{F}^{rgb}$ and $\mathbf{M}^{motion}$, and establish relationships among them. Then, the optimized cameras $\mathbf{Cam}^{O}$ are learned in the second joint optimization module, starting with randomly initialized camera $\mathbf{Cam}^{RI}$ and $\mathbf{SP}^{3D}$, supervised by the estimated $\mathbf{SP}^{2D}$ from SPE. Finally, given $\mathbf{T}$, the optimized $\mathbf{Cam}^{O}$ and $\mathbf{SP}^{3D}$, a Canonical Field and a Deformation Field (see text for details) are computed to optimize the mean representations and deformations of the scene, respectively, supervised by $\mathbf{F}^{rgb}$. In the middle of the figure, we show the learned camera positions • and orientations $\rightarrow$, and the optimized $\mathbf{SP}^{3D}$•. $\rightarrow$ and $\leftarrow$ respectively represent operations flow and gradient flow.
Figure 2: Visual Camera Comparisons on NeRF-DS. The red •, blue •, and black • bullets respectively represent the estimated camera poses by our approach, COLMAP colmap, and RodynRF rodynrf.
Figure 3: Optimized Point Cloud Comparisons. We take the plate scene in the NeRF-DS nerfds dataset as the example here and show more in Appendix Sec \ref{['morepointcloud']}. The boxes and the corresponding viewpoints are color-coded. The dense points due to the back wall plane formed using our estimated camera parameters, shown in the green boxes and blue boxes, can be seen to be more reasonable, in comparison with the scattered points from the same back wall formed using COLMAP camera parameters. Similar comments apply to the red boxes corresponding to the window points.
Figure 4: Rendering & Camera Pose Comparisons on DAVIS. For each scene, we show the camera pose comparisons and rendering comparisons among Deformable-3DGS deformable-3dgs, RoDynRF rodynrf and ours, marking the relatively large pose or rendering differences with red boxes.
Figure 5: Visual Novel View Synthesis Results on NeRF-DS.
...and 4 more figures

Self-Calibrating 4D Novel View Synthesis from Monocular Videos Using Gaussian Splatting

TL;DR

Abstract

Self-Calibrating 4D Novel View Synthesis from Monocular Videos Using Gaussian Splatting

Authors

TL;DR

Abstract

Table of Contents

Figures (9)