Table of Contents
Fetching ...

Spatial-Temporal State Propagation Autoregressive Model for 4D Object Generation

Liying Yang, Jialun Liu, Jiakui Hu, Chenhao Guan, Haibin Huang, Fangqiu Yi, Chi Zhang, Yanyan Liang

TL;DR

4DSTAR tackles the challenge of spatial-temporal inconsistency in 4D object generation by introducing a feed-forward autoregressive framework that leverages long-term dependencies across timesteps. It combines a Dynamic Spatial-Temporal State Propagation AutoRegressive Model (STAR) with a 4D Vector Quantized Variational Autoencoder (4D VQ-VAE) that decodes tokens into temporally coherent dynamic 3D Gaussians, refined by a Spatial-Temporal Offset Predictor. Key contributions include the first autoregressive approach for 4D object generation, the Spatial-Temporal Container that builds long-term dependencies via clustering of historical features, and the STOP module that enforces temporal alignment across frames. Experimental results on Objaverse-based data show that 4DSTAR achieves strong spatial-temporal consistency and competitive performance with diffusion-based methods on reconstruction and video-to-4D generation, while enabling efficient, feed-forward generation of dynamic 4D content. This work enables robust text/video-to-4D and text/image-to-3D generation with improved temporal stability and fidelity.

Abstract

Generating high-quality 4D objects with spatial-temporal consistency is still formidable. Existing diffusion-based methods often struggle with spatial-temporal inconsistency, as they fail to leverage outputs from all previous timesteps to guide the generation at the current timestep. Therefore, we propose a Spatial-Temporal State Propagation AutoRegressive Model (4DSTAR), which generates 4D objects maintaining temporal-spatial consistency. 4DSTAR formulates the generation problem as the prediction of tokens that represent the 4D object. It consists of two key components: (1) The dynamic spatial-temporal state propagation autoregressive model (STAR) is proposed, which achieves spatial-temporal consistent generation. Unlike standard autoregressive models, STAR divides prediction tokens into groups based on timesteps. It models long-term dependencies by propagating spatial-temporal states from previous groups and utilizes these dependencies to guide generation at the next timestep. To this end, a spatial-temporal container is proposed, which dynamically updating the effective spatial-temporal state features from all historical groups, then updated features serve as conditional features to guide the prediction of the next token group. (2) The 4D VQ-VAE is proposed, which implicitly encodes the 4D structure into discrete space and decodes the discrete tokens predicted by STAR into temporally coherent dynamic 3D Gaussians. Experiments demonstrate that 4DSTAR generates spatial-temporal consistent 4D objects, and achieves performance competitive with diffusion models.

Spatial-Temporal State Propagation Autoregressive Model for 4D Object Generation

TL;DR

4DSTAR tackles the challenge of spatial-temporal inconsistency in 4D object generation by introducing a feed-forward autoregressive framework that leverages long-term dependencies across timesteps. It combines a Dynamic Spatial-Temporal State Propagation AutoRegressive Model (STAR) with a 4D Vector Quantized Variational Autoencoder (4D VQ-VAE) that decodes tokens into temporally coherent dynamic 3D Gaussians, refined by a Spatial-Temporal Offset Predictor. Key contributions include the first autoregressive approach for 4D object generation, the Spatial-Temporal Container that builds long-term dependencies via clustering of historical features, and the STOP module that enforces temporal alignment across frames. Experimental results on Objaverse-based data show that 4DSTAR achieves strong spatial-temporal consistency and competitive performance with diffusion-based methods on reconstruction and video-to-4D generation, while enabling efficient, feed-forward generation of dynamic 4D content. This work enables robust text/video-to-4D and text/image-to-3D generation with improved temporal stability and fidelity.

Abstract

Generating high-quality 4D objects with spatial-temporal consistency is still formidable. Existing diffusion-based methods often struggle with spatial-temporal inconsistency, as they fail to leverage outputs from all previous timesteps to guide the generation at the current timestep. Therefore, we propose a Spatial-Temporal State Propagation AutoRegressive Model (4DSTAR), which generates 4D objects maintaining temporal-spatial consistency. 4DSTAR formulates the generation problem as the prediction of tokens that represent the 4D object. It consists of two key components: (1) The dynamic spatial-temporal state propagation autoregressive model (STAR) is proposed, which achieves spatial-temporal consistent generation. Unlike standard autoregressive models, STAR divides prediction tokens into groups based on timesteps. It models long-term dependencies by propagating spatial-temporal states from previous groups and utilizes these dependencies to guide generation at the next timestep. To this end, a spatial-temporal container is proposed, which dynamically updating the effective spatial-temporal state features from all historical groups, then updated features serve as conditional features to guide the prediction of the next token group. (2) The 4D VQ-VAE is proposed, which implicitly encodes the 4D structure into discrete space and decodes the discrete tokens predicted by STAR into temporally coherent dynamic 3D Gaussians. Experiments demonstrate that 4DSTAR generates spatial-temporal consistent 4D objects, and achieves performance competitive with diffusion models.
Paper Structure (18 sections, 4 equations, 8 figures, 4 tables)

This paper contains 18 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Diffusion-based methods, such as previous work yao2025sv4d, fail to leverage outputs from all previous timesteps to guide the generation at the current timestep, which generates results with inconsistent appearance at some timesteps. Our 4DSTAR alleviates this issue by leveraging historical outputs to guide the generation at the current timestep.
  • Figure 2: The overall pipeline of our 4DSTAR. 4DSTAR consists of two key components: (a) 4D VQ-VAE. Given a 4D object, we first render it as a spatial-temporal matrix. Then the matrix is encoded by Encoder, and is compressed into discrete tokens. Static GS Generation decodes these tokens to static Gaussians. Meanwhile, Spatial-Temporal Offset Predictor (STOP) corrects static Gaussians into a canonical 4D space at each timestep. Finally, the model outputs dynamic 3D Gaussians. (b) Dynamic Spatial-Temporal State Propagation Autoregressive Model (STAR). The text and video conditions, which are compressed by an image tokenizer are concatenated before the start token as the context. The conditions can either expect the model to generate tokens. The SEP signals the model to begin generating tokens. Then, camera pose and timestep conditions are integrated. When the model starts to predict the next group, the historical groups are integrated into Spatial-Temporal Container (S-T Container). S-T Container updates effective spatial-temporal state features. The features serve as conditional features to guide the prediction of the next token group. Finally, STAR predicts all tokens that represent a 4D object.
  • Figure 3: Qualitative comparison with VQ-VAE sun2024autoregressive and UniTok ma2025unitok on 4D reconstruction. For VQ-VAE and UniTok, we employ them to reconstruct 2D view images. For our 4D VQ-VAE, we render results under corresponding views. Our 4D VQ-VAE can reconstruct the results with temporal coherence, while VQ-VAE and UniTok cannot reconstruct with temporal coherence.
  • Figure 4: Qualitative comparison with the state-of-the-art methods stag4dren2024l4gmzhang2025gaussianyao2025sv4d on video-to-4D generation. For each method, we render results under two novel views at two timesteps. Our method achieves high-quality generation with spatial-temporal consistency.
  • Figure 5: Ablation of 4D VQ-VAE. Our model which uses STOP, accurately recovers texture details at different timesteps.
  • ...and 3 more figures