4DSTR: Advancing Generative 4D Gaussians with Spatial-Temporal Rectification for High-Quality and Consistent 4D Generation
Mengmeng Liu, Jiuming Liu, Yunpeng Zhang, Jiangtao Li, Michael Ying Yang, Francesco Nex, Hao Cheng
TL;DR
4DSTR addresses spatial-temporal inconsistency in 4D content by introducing temporal correlation of deformable 4D Gaussian points via a Mamba-based encoding layer and an adaptive per-frame Gaussian densification/pruning strategy. It rectifies per-frame Gaussian scales and rotations through learned residuals and maintains frame alignment with a temporal memory, achieving state-of-the-art performance on video-to-4D benchmarks with significant reductions in $\text{FID-VID}$ and $\text{FVD}$ over prior methods. The approach also supports text-to-4D generation when combined with diffusion priors, extending high-quality 4D synthesis to language-guided scenarios. Overall, 4DSTR yields higher reconstruction quality, stronger spatial-temporal consistency, and robust adaptation to rapid motion, with practical impact for simulation, VR, and avatar animation.
Abstract
Remarkable advances in recent 2D image and 3D shape generation have induced a significant focus on dynamic 4D content generation. However, previous 4D generation methods commonly struggle to maintain spatial-temporal consistency and adapt poorly to rapid temporal variations, due to the lack of effective spatial-temporal modeling. To address these problems, we propose a novel 4D generation network called 4DSTR, which modulates generative 4D Gaussian Splatting with spatial-temporal rectification. Specifically, temporal correlation across generated 4D sequences is designed to rectify deformable scales and rotations and guarantee temporal consistency. Furthermore, an adaptive spatial densification and pruning strategy is proposed to address significant temporal variations by dynamically adding or deleting Gaussian points with the awareness of their pre-frame movements. Extensive experiments demonstrate that our 4DSTR achieves state-of-the-art performance in video-to-4D generation, excelling in reconstruction quality, spatial-temporal consistency, and adaptation to rapid temporal movements.
