Table of Contents
Fetching ...

STP4D: Spatio-Temporal-Prompt Consistent Modeling for Text-to-4D Gaussian Splatting

Yunze Deng, Haijun Xiong, Bin Feng, Xinggang Wang, Wenyu Liu

TL;DR

STP4D tackles the problem of spatio-temporal-prompt misalignment in text-to-4D generation by introducing a unified framework that combines Time-varying Prompt Embedding, Geometric Information Enhancement, and Temporal Extension Deformation. By integrating DDIM-based diffusion with 4D Gaussian Splatting and a GroupFormer-enabled geometry module, it achieves high-quality, temporally coherent 4D content with rapid inference (~$4.6$s per asset) on Diffusion4D. The work demonstrates state-of-the-art quantitative gains in CLIP-based semantic alignment and FVD, supported by ablations that confirm the essential roles of each module and constraint. While effective, it notes limitations due to dataset scale and fixed Gaussian counts, suggesting that richer datasets and larger models could further enhance performance in complex scenes.

Abstract

Text-to-4D generation is rapidly developing and widely applied in various scenarios. However, existing methods often fail to incorporate adequate spatio-temporal modeling and prompt alignment within a unified framework, resulting in temporal inconsistencies, geometric distortions, or low-quality 4D content that deviates from the provided texts. Therefore, we propose STP4D, a novel approach that aims to integrate comprehensive spatio-temporal-prompt consistency modeling for high-quality text-to-4D generation. Specifically, STP4D employs three carefully designed modules: Time-varying Prompt Embedding, Geometric Information Enhancement, and Temporal Extension Deformation, which collaborate to accomplish this goal. Furthermore, STP4D is among the first methods to exploit the Diffusion model to generate 4D Gaussians, combining the fine-grained modeling capabilities and the real-time rendering process of 4DGS with the rapid inference speed of the Diffusion model. Extensive experiments demonstrate that STP4D excels in generating high-fidelity 4D content with exceptional efficiency (approximately 4.6s per asset), surpassing existing methods in both quality and speed.

STP4D: Spatio-Temporal-Prompt Consistent Modeling for Text-to-4D Gaussian Splatting

TL;DR

STP4D tackles the problem of spatio-temporal-prompt misalignment in text-to-4D generation by introducing a unified framework that combines Time-varying Prompt Embedding, Geometric Information Enhancement, and Temporal Extension Deformation. By integrating DDIM-based diffusion with 4D Gaussian Splatting and a GroupFormer-enabled geometry module, it achieves high-quality, temporally coherent 4D content with rapid inference (~s per asset) on Diffusion4D. The work demonstrates state-of-the-art quantitative gains in CLIP-based semantic alignment and FVD, supported by ablations that confirm the essential roles of each module and constraint. While effective, it notes limitations due to dataset scale and fixed Gaussian counts, suggesting that richer datasets and larger models could further enhance performance in complex scenes.

Abstract

Text-to-4D generation is rapidly developing and widely applied in various scenarios. However, existing methods often fail to incorporate adequate spatio-temporal modeling and prompt alignment within a unified framework, resulting in temporal inconsistencies, geometric distortions, or low-quality 4D content that deviates from the provided texts. Therefore, we propose STP4D, a novel approach that aims to integrate comprehensive spatio-temporal-prompt consistency modeling for high-quality text-to-4D generation. Specifically, STP4D employs three carefully designed modules: Time-varying Prompt Embedding, Geometric Information Enhancement, and Temporal Extension Deformation, which collaborate to accomplish this goal. Furthermore, STP4D is among the first methods to exploit the Diffusion model to generate 4D Gaussians, combining the fine-grained modeling capabilities and the real-time rendering process of 4DGS with the rapid inference speed of the Diffusion model. Extensive experiments demonstrate that STP4D excels in generating high-fidelity 4D content with exceptional efficiency (approximately 4.6s per asset), surpassing existing methods in both quality and speed.

Paper Structure

This paper contains 26 sections, 17 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Visual comparisons between STP4D and other methods highlight the superiority of spatio-temporal-prompt consistency. (a) The panda generated from 4DFY, marked by the red circle, exhibits unreasonable body structures; (b) The clownfish from Animate124 shows drastic appearance changes across frames during swimming; (c) The robot generated by MAV3D misinterprets the prompt, failing to correctly throw and flip the coin.
  • Figure 2: (a) Pipeline of STP4D; (b) Details of Geometric Information Enhancement (GIE); (c) Details of Temporal Extension Deformation (TED).
  • Figure 3: (a) Visual comparisons between STP4D and other competitive methods. (b) Various 4D assets generated from STP4D.
  • Figure 4: The detailed structure of GroupFormer. $D'$ represents the hidden dimension.
  • Figure 5: Various visualizations of 4D assets generated from STP4D.