Table of Contents
Fetching ...

Turbo4DGen: Ultra-Fast Acceleration for 4D Generation

Yuanbin Man, Ying Huang, Zhile Ren, Miao Yin

Abstract

4D generation, or dynamic 3D content generation, integrates spatial, temporal, and view dimensions to model realistic dynamic scenes, playing a foundational role in advancing world models and physical AI. However, maintaining long-chain consistency across both frames and viewpoints through the unique spatio-camera-motion (SCM) attention mechanism introduces substantial computational and memory overhead, often leading to out-of-memory (OOM) failures and prohibitive generation times. To address these challenges, we propose Turbo4DGen, an ultra-fast acceleration framework for diffusion-based multi-view 4D content generation. Turbo4DGen introduces a spatiotemporal cache mechanism that persistently reuses intermediate attention across denoising steps, combined with dynamically semantic-aware attention pruning and an adaptive SCM chain bypass scheduler, to drastically reduce redundant SCM attention computation. Our experimental results show that Turbo4DGen achieves an average 9.7$\times$ speedup without quality degradation on the ObjaverseDy and Consistent4D datasets. To the best of our knowledge, Turbo4DGen is the first dedicated acceleration framework for 4D generation.

Turbo4DGen: Ultra-Fast Acceleration for 4D Generation

Abstract

4D generation, or dynamic 3D content generation, integrates spatial, temporal, and view dimensions to model realistic dynamic scenes, playing a foundational role in advancing world models and physical AI. However, maintaining long-chain consistency across both frames and viewpoints through the unique spatio-camera-motion (SCM) attention mechanism introduces substantial computational and memory overhead, often leading to out-of-memory (OOM) failures and prohibitive generation times. To address these challenges, we propose Turbo4DGen, an ultra-fast acceleration framework for diffusion-based multi-view 4D content generation. Turbo4DGen introduces a spatiotemporal cache mechanism that persistently reuses intermediate attention across denoising steps, combined with dynamically semantic-aware attention pruning and an adaptive SCM chain bypass scheduler, to drastically reduce redundant SCM attention computation. Our experimental results show that Turbo4DGen achieves an average 9.7 speedup without quality degradation on the ObjaverseDy and Consistent4D datasets. To the best of our knowledge, Turbo4DGen is the first dedicated acceleration framework for 4D generation.

Paper Structure

This paper contains 15 sections, 10 equations, 13 figures, 5 tables, 1 algorithm.

Figures (13)

  • Figure 1: Generated examples of Turbo4DGen, in comparison with the baseline, SV4D xie2025svd. Our Turbo4DGen completes the above 4D generation examples in only 9.78s and 12.15s, respectively, whereas SV4D requires around 2 mins (110.85s and 118.76s), yielding 11.33$\times$ and 9.77$\times$ speedups without sacrificing content quality.
  • Figure 2: Latency analysis of multiple components in 4D generation xie2025svd. The results show that SCM attention is the main bottleneck, accounting for most of the computational overhead.
  • Figure 3: Performance analysis of removing spatial, camera, or motion attention blocks. It is observed that spatial attention block plays a more critical role in the SCM attention chain.
  • Figure 4: The outputs of the SCM attention blocks in the final layer (downsampled to 16$\times$16 for visualization) across adjacent denoising steps (sampling every two steps) exhibit a cosine similarity exceeding 95%, indicating strong redundancy between consecutive steps.
  • Figure 5: Visualization of the spatial cross-attention map indicating semantic representations. (a) Two frames from a reference video; (b) The corresponding spatial cross-attention map; (c) Top-$K$ relevant tokens (dotted in red) representing semantic features.
  • ...and 8 more figures