Disentangled 4D Gaussian Splatting: Rendering High-Resolution Dynamic World at 343 FPS
Hao Feng, Hao Sun, Wei Xie, Zhi Zuo, Zhengzhe Liu
TL;DR
Disentangled4DGS tackles real-time dynamic novel view synthesis by decoupling temporal and spatial factors in 4D Gaussian representations and adopting a projection-first rendering pipeline. By modeling a 4D Gaussian with a 3D base, temporal scaling, and velocity of the mean, and projecting to ray space before slicing, the method avoids expensive 4D-to-3D recomputation and achieves high FPS with reduced storage. A flow-gradient guided consistency loss together with a temporal splitting strategy improves motion fidelity and reduces artifacts, especially at motion boundaries. Evaluations across Plenoptic, Google Immersive, HyperNeRF, and D-NeRF datasets show a new performance benchmark, delivering up to 343 FPS at $1352\times1014$ on an RTX 3090 and at least 4.5% storage reduction while outperforming prior 4DGS methods in both quality and speed.
Abstract
While dynamic novel view synthesis from 2D videos has seen progress, achieving efficient reconstruction and rendering of dynamic scenes remains a challenging task. In this paper, we introduce Disentangled 4D Gaussian Splatting (Disentangled4DGS), a novel representation and rendering pipeline that achieves real-time performance without compromising visual fidelity. Disentangled4DGS decouples the temporal and spatial components of 4D Gaussians, avoiding the need for slicing first and four-dimensional matrix calculations in prior methods. By projecting temporal and spatial deformations into dynamic 2D Gaussians and deferring temporal processing, we minimize redundant computations of 4DGS. Our approach also features a gradient-guided flow loss and temporal splitting strategy to reduce artifacts. Experiments demonstrate a significant improvement in rendering speed and quality, achieving 343 FPS when render 1352*1014 resolution images on a single RTX3090 while reducing storage requirements by at least 4.5%. Our approach sets a new benchmark for dynamic novel view synthesis, outperforming existing methods on both multi-view and monocular dynamic scene datasets.
