Table of Contents
Fetching ...

FastSTAR: Spatiotemporal Token Pruning for Efficient Autoregressive Video Synthesis

Sungwoong Yune, Suheon Jeong, Joo-Young Kim

TL;DR

FastSTAR is proposed, a training-free acceleration framework designed for high-quality video generation that ensures that only non-converged regions are refined, maintaining fluid motion while bypassing redundant computations in regions where further refinement becomes redundant.

Abstract

Visual Autoregressive modeling (VAR) has emerged as a highly efficient alternative to diffusion-based frameworks, achieving comparable synthesis quality. However, as this paradigm extends to Spacetime Autoregressive modeling (STAR) for video generation, scaling resolution and frame counts leads to a "token explosion" that creates a massive computational bottleneck in the final refinement stages. To address this, we propose FastSTAR, a training-free acceleration framework designed for high-quality video generation. Our core method, Spatiotemporal Token Pruning, identifies essential tokens by integrating two specialized terms: (1) Spatial similarity, which evaluates structural convergence across hierarchical scales to skip computations in regions where further refinement becomes redundant, and (2) Temporal similarity, which identifies active motion trajectories by assessing feature-level variations relative to the preceding clip. Combined with a Partial Update mechanism, FastSTAR ensures that only non-converged regions are refined, maintaining fluid motion while bypassing redundant computations. Experimental results on InfinityStar demonstrate that FastSTAR achieves up to a 2.01x speedup with a PSNR of 28.29 and less than 1% performance degradation, proving a superior efficiency-quality trade-off for STAR-based video synthesis.

FastSTAR: Spatiotemporal Token Pruning for Efficient Autoregressive Video Synthesis

TL;DR

FastSTAR is proposed, a training-free acceleration framework designed for high-quality video generation that ensures that only non-converged regions are refined, maintaining fluid motion while bypassing redundant computations in regions where further refinement becomes redundant.

Abstract

Visual Autoregressive modeling (VAR) has emerged as a highly efficient alternative to diffusion-based frameworks, achieving comparable synthesis quality. However, as this paradigm extends to Spacetime Autoregressive modeling (STAR) for video generation, scaling resolution and frame counts leads to a "token explosion" that creates a massive computational bottleneck in the final refinement stages. To address this, we propose FastSTAR, a training-free acceleration framework designed for high-quality video generation. Our core method, Spatiotemporal Token Pruning, identifies essential tokens by integrating two specialized terms: (1) Spatial similarity, which evaluates structural convergence across hierarchical scales to skip computations in regions where further refinement becomes redundant, and (2) Temporal similarity, which identifies active motion trajectories by assessing feature-level variations relative to the preceding clip. Combined with a Partial Update mechanism, FastSTAR ensures that only non-converged regions are refined, maintaining fluid motion while bypassing redundant computations. Experimental results on InfinityStar demonstrate that FastSTAR achieves up to a 2.01x speedup with a PSNR of 28.29 and less than 1% performance degradation, proving a superior efficiency-quality trade-off for STAR-based video synthesis.
Paper Structure (20 sections, 7 equations, 18 figures, 5 tables)

This paper contains 20 sections, 7 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: FastSTAR achieves a 2.01$\times$ end-to-end speedup on a single H100 GPU for T2V and I2V tasks, maintaining high fidelity with PSNR of 28.29 and 25.65, respectively.
  • Figure 2: (a) Latency breakdown for generating a 720p video (5s, 81 frames), comparing InfinityStar with FastSTAR. (b) Latency vs. PSNR trade-off curves for T2V synthesis. FastSTAR establishes the Pareto frontier, consistently surpassing existing baselines.
  • Figure 3: (a) Overview of the FastSTAR framework. (b) Overall Mechanism of FastSTAR; Spatiotemporal Token Pruning identifies converged regions to completely skip transformer blocks, while Partial Update preserves structural integrity.
  • Figure 4: (a) Scale-wise Fourier-based spectral analysis demonstrates early low-frequency convergence alongside sustained high-frequency updates. (b) High-frequency energy maps show intense localization. Comparing high-intensity locations with large incremental updates reveals significant opportunities for token pruning in converged regions.
  • Figure 5: Comparison of token pruning and merging. (a) Token merging at 40% obliterates high-frequency textures, while pruning preserves fine details even at high compression. (b) Merging exhibits higher average MSE and expanding variance across resolution scales, whereas pruning maintains tighter error bounds.
  • ...and 13 more figures