Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels

Yikai Wang; Xinzhou Wang; Zilong Chen; Zhengyi Wang; Fuchun Sun; Jun Zhu

Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels

Yikai Wang, Xinzhou Wang, Zilong Chen, Zhengyi Wang, Fuchun Sun, Jun Zhu

TL;DR

Vidu4D is presented, a novel reconstruction model that excels in accurately reconstructing 4D representations from single generated videos, addressing challenges associated with non-rigidity and frame distortion.

Abstract

Video generative models are receiving particular attention given their ability to generate realistic and imaginative frames. Besides, these models are also observed to exhibit strong 3D consistency, significantly enhancing their potential to act as world simulators. In this work, we present Vidu4D, a novel reconstruction model that excels in accurately reconstructing 4D (i.e., sequential 3D) representations from single generated videos, addressing challenges associated with non-rigidity and frame distortion. This capability is pivotal for creating high-fidelity virtual contents that maintain both spatial and temporal coherence. At the core of Vidu4D is our proposed Dynamic Gaussian Surfels (DGS) technique. DGS optimizes time-varying warping functions to transform Gaussian surfels (surface elements) from a static state to a dynamically warped state. This transformation enables a precise depiction of motion and deformation over time. To preserve the structural integrity of surface-aligned Gaussian surfels, we design the warped-state geometric regularization based on continuous warping fields for estimating normals. Additionally, we learn refinements on rotation and scaling parameters of Gaussian surfels, which greatly alleviates texture flickering during the warping process and enhances the capture of fine-grained appearance details. Vidu4D also contains a novel initialization state that provides a proper start for the warping fields in DGS. Equipping Vidu4D with an existing video generative model, the overall framework demonstrates high-fidelity text-to-4D generation in both appearance and geometry.

Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels

TL;DR

Abstract

Paper Structure (16 sections, 9 equations, 12 figures, 3 tables)

This paper contains 16 sections, 9 equations, 12 figures, 3 tables.

Introduction
Related works
Method
Problem Definition
Dynamic Gaussian Surfels
Vidu4D
Experiment
Implementation
Qualitative Evaluation
Quantitative Evaluation
Ablations
Conclusion
Appendix / supplemental material
Ablation: Field Initialization and Refinement
Additional Qualitative Comparison
...and 1 more sections

Figures (12)

Figure 1: Text-(to-video)-to-4D samples generated by equipping Vidu4D with a pretrained video diffusion model bao2024vidu. For each sample, we exhibit per-frame 3D rendering for novel-view color, normal, and surfel feature. We observe that Vidu4D can reconstruct precisely detailed and photo-realistic 4D representation. See our accompanying videos in our https://vidu4d-dgs.github.io for better visual quality.
Figure 2: Illustration of the overall framework and our DGS in detail. For DGS, Gaussian surfels in the static state are transformed to the warped state by learning non-rigid warping functions conditioned on time $t$ and coordinate $\mathbf{u}$. We incorporate warped-state normal regularization for accurate geometry, and refined rotation and scaling matrices of Gaussian surfels for detailed appearance. Both branches in the warped state, including with and without refinement, share the same centers of Gaussian surfels and the same warping functions. "Field init." stands for field initialization as introduced in Sec. \ref{['subsec:vidu4d']}.
Figure 3: Illustration of the pipeline of Vidu4D, including the initialization stage and the DGS stage.
Figure 4: Novel-view qualitative evaluation compared with SOTA methods including NeRF-based methods (BANMo DBLP:conf/cvpr/YangVNRVJ22 and D-NeRF pumarola2021d) and Gaussian splatting-based methods (Deformable-GS yang2023deformable3dgs and SCGS huang2023sc). We also provide our learned camera poses to baseline approaches for a fair comparison. These variants are denoted as "w. Poses". Best view in color and zoom in.
Figure 5: Ablation studies on the geometric regularization and refinement strategy. For our full model shown in (b), we provide our rendered color, rendered normal, and surface normal (estimated from the depth points for regularization). Additionally, for comparison, we visualize the rendered color for the case without refinements in (c) and the rendered normal for the case without warped-state normal regularization in (d), respectively. We showcase our model's fidelity with close-ups.
...and 7 more figures

Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels

TL;DR

Abstract

Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels

Authors

TL;DR

Abstract

Table of Contents

Figures (12)