Table of Contents
Fetching ...

4DSTR: Advancing Generative 4D Gaussians with Spatial-Temporal Rectification for High-Quality and Consistent 4D Generation

Mengmeng Liu, Jiuming Liu, Yunpeng Zhang, Jiangtao Li, Michael Ying Yang, Francesco Nex, Hao Cheng

TL;DR

4DSTR addresses spatial-temporal inconsistency in 4D content by introducing temporal correlation of deformable 4D Gaussian points via a Mamba-based encoding layer and an adaptive per-frame Gaussian densification/pruning strategy. It rectifies per-frame Gaussian scales and rotations through learned residuals and maintains frame alignment with a temporal memory, achieving state-of-the-art performance on video-to-4D benchmarks with significant reductions in $\text{FID-VID}$ and $\text{FVD}$ over prior methods. The approach also supports text-to-4D generation when combined with diffusion priors, extending high-quality 4D synthesis to language-guided scenarios. Overall, 4DSTR yields higher reconstruction quality, stronger spatial-temporal consistency, and robust adaptation to rapid motion, with practical impact for simulation, VR, and avatar animation.

Abstract

Remarkable advances in recent 2D image and 3D shape generation have induced a significant focus on dynamic 4D content generation. However, previous 4D generation methods commonly struggle to maintain spatial-temporal consistency and adapt poorly to rapid temporal variations, due to the lack of effective spatial-temporal modeling. To address these problems, we propose a novel 4D generation network called 4DSTR, which modulates generative 4D Gaussian Splatting with spatial-temporal rectification. Specifically, temporal correlation across generated 4D sequences is designed to rectify deformable scales and rotations and guarantee temporal consistency. Furthermore, an adaptive spatial densification and pruning strategy is proposed to address significant temporal variations by dynamically adding or deleting Gaussian points with the awareness of their pre-frame movements. Extensive experiments demonstrate that our 4DSTR achieves state-of-the-art performance in video-to-4D generation, excelling in reconstruction quality, spatial-temporal consistency, and adaptation to rapid temporal movements.

4DSTR: Advancing Generative 4D Gaussians with Spatial-Temporal Rectification for High-Quality and Consistent 4D Generation

TL;DR

4DSTR addresses spatial-temporal inconsistency in 4D content by introducing temporal correlation of deformable 4D Gaussian points via a Mamba-based encoding layer and an adaptive per-frame Gaussian densification/pruning strategy. It rectifies per-frame Gaussian scales and rotations through learned residuals and maintains frame alignment with a temporal memory, achieving state-of-the-art performance on video-to-4D benchmarks with significant reductions in and over prior methods. The approach also supports text-to-4D generation when combined with diffusion priors, extending high-quality 4D synthesis to language-guided scenarios. Overall, 4DSTR yields higher reconstruction quality, stronger spatial-temporal consistency, and robust adaptation to rapid motion, with practical impact for simulation, VR, and avatar animation.

Abstract

Remarkable advances in recent 2D image and 3D shape generation have induced a significant focus on dynamic 4D content generation. However, previous 4D generation methods commonly struggle to maintain spatial-temporal consistency and adapt poorly to rapid temporal variations, due to the lack of effective spatial-temporal modeling. To address these problems, we propose a novel 4D generation network called 4DSTR, which modulates generative 4D Gaussian Splatting with spatial-temporal rectification. Specifically, temporal correlation across generated 4D sequences is designed to rectify deformable scales and rotations and guarantee temporal consistency. Furthermore, an adaptive spatial densification and pruning strategy is proposed to address significant temporal variations by dynamically adding or deleting Gaussian points with the awareness of their pre-frame movements. Extensive experiments demonstrate that our 4DSTR achieves state-of-the-art performance in video-to-4D generation, excelling in reconstruction quality, spatial-temporal consistency, and adaptation to rapid temporal movements.

Paper Structure

This paper contains 19 sections, 9 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Consistent 4D generation with spatial‐temporal rectification. Our method proposes a novel framework for high‐quality 4D generation as in (A). Compared to the state‐of‐the‐art method STAG4D zeng2024stag4d, our method has higher generation consistency and quality in the dynamic region (red circle) of generated 4D sequences (B), which demonstrates that our rectification methods significantly boost spatial‐temporal consistency in generative 4D Gaussian representations.
  • Figure 2: Rapid temporal variations among frames. The mouth of Minions witnesses rapid appearance variations for two different frames. Compared to STAG4D zeng2024stag4d, our method designs an adaptive Gaussian densification and pruning strategy, which largely enhances the adaptation capability of our 4D generative Gaussian.
  • Figure 3: The overall pipeline of our 4DSTR. Given an input video, we use Zero123++ shi2023zero123++ to generate multi-view frames and initialize the first-frame 3D Gaussians. A lightweight multi-head decoder then maps voxel features to per-frame 4D Gaussian parameters. To ensure 4D coherence, our temporal correlation module regresses scale and rotation residuals, while per-frame adaptive densification and pruning dynamically adjust Gaussian counts to capture rapid spatial changes.
  • Figure 4: Illustration of Per-frame Adaptive Gaussian Densification strategy. We accumulate and average each Gaussian point’s gradient over training steps. Then, at each timestep $t$, we independently apply densification or pruning based on its averaged gradient. For example, when a Minion’s mouth opens at $F_t$, we densify that region; when it closes at $F_T$, we prune it.
  • Figure 5: Qualitative comparisons on video-to-4D generation. Compared with the recent SOTA method STAG4D zeng2024stag4d, our method delivers higher-quality results in dynamic regions such as squirrel and deer heads or an elephant’s trunk.
  • ...and 3 more figures