STP4D: Spatio-Temporal-Prompt Consistent Modeling for Text-to-4D Gaussian Splatting
Yunze Deng, Haijun Xiong, Bin Feng, Xinggang Wang, Wenyu Liu
TL;DR
STP4D tackles the problem of spatio-temporal-prompt misalignment in text-to-4D generation by introducing a unified framework that combines Time-varying Prompt Embedding, Geometric Information Enhancement, and Temporal Extension Deformation. By integrating DDIM-based diffusion with 4D Gaussian Splatting and a GroupFormer-enabled geometry module, it achieves high-quality, temporally coherent 4D content with rapid inference (~$4.6$s per asset) on Diffusion4D. The work demonstrates state-of-the-art quantitative gains in CLIP-based semantic alignment and FVD, supported by ablations that confirm the essential roles of each module and constraint. While effective, it notes limitations due to dataset scale and fixed Gaussian counts, suggesting that richer datasets and larger models could further enhance performance in complex scenes.
Abstract
Text-to-4D generation is rapidly developing and widely applied in various scenarios. However, existing methods often fail to incorporate adequate spatio-temporal modeling and prompt alignment within a unified framework, resulting in temporal inconsistencies, geometric distortions, or low-quality 4D content that deviates from the provided texts. Therefore, we propose STP4D, a novel approach that aims to integrate comprehensive spatio-temporal-prompt consistency modeling for high-quality text-to-4D generation. Specifically, STP4D employs three carefully designed modules: Time-varying Prompt Embedding, Geometric Information Enhancement, and Temporal Extension Deformation, which collaborate to accomplish this goal. Furthermore, STP4D is among the first methods to exploit the Diffusion model to generate 4D Gaussians, combining the fine-grained modeling capabilities and the real-time rendering process of 4DGS with the rapid inference speed of the Diffusion model. Extensive experiments demonstrate that STP4D excels in generating high-fidelity 4D content with exceptional efficiency (approximately 4.6s per asset), surpassing existing methods in both quality and speed.
