Table of Contents
Fetching ...

TIMERIPPLE: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space

Wenxuan Miao, Yulin Sun, Aiyue Chen, Jing Lin, Yiwu Yao, Yiming Gan, Jieru Zhao, Jingwen Leng, Mingyi Guo, Yu Feng

TL;DR

The work addresses the bottleneck of self-attention in video diffusion transformers by uncovering and exploiting spatio-temporal correlations at the token-channel level within the latent space. It introduces TimeRipple, an adaptive token-reuse framework that leverages RoPE-based channel groups to reuse partial attention scores across time and spatial axes, guided by step-sensitive thresholds. Empirical results across four mainstream vDiTs show TimeRipple achieving up to 2.7× end-to-end speedups with minimal perceptual degradation (VBench loss < 0.06%), and up to 85% savings in self-attention computations, with additional gains when combined with existing sparsity techniques. The method offers a practical, model-agnostic acceleration path for commercial deployment of high-quality video generation, while noting integration challenges with current attention accelerators due to induced unstructured sparsity.

Abstract

The recent surge in video generation has shown the growing demand for high-quality video synthesis using large vision models. Existing video generation models are predominantly based on the video diffusion transformer (vDiT), however, they suffer from substantial inference delay due to self-attention. While prior studies have focused on reducing redundant computations in self-attention, they often overlook the inherent spatio-temporal correlations in video streams and directly leverage sparsity patterns from large language models to reduce attention computations. In this work, we take a principled approach to accelerate self-attention in vDiTs by leveraging the spatio-temporal correlations in the latent space. We show that the attention patterns within vDiT are primarily due to the dominant spatial and temporal correlations at the token channel level. Based on this insight, we propose a lightweight and adaptive reuse strategy that approximates attention computations by reusing partial attention scores of spatially or temporally correlated tokens along individual channels. We demonstrate that our method achieves significantly higher computational savings (85\%) compared to state-of-the-art techniques over 4 vDiTs, while preserving almost identical video quality ($<$0.06\% loss on VBench).

TIMERIPPLE: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space

TL;DR

The work addresses the bottleneck of self-attention in video diffusion transformers by uncovering and exploiting spatio-temporal correlations at the token-channel level within the latent space. It introduces TimeRipple, an adaptive token-reuse framework that leverages RoPE-based channel groups to reuse partial attention scores across time and spatial axes, guided by step-sensitive thresholds. Empirical results across four mainstream vDiTs show TimeRipple achieving up to 2.7× end-to-end speedups with minimal perceptual degradation (VBench loss < 0.06%), and up to 85% savings in self-attention computations, with additional gains when combined with existing sparsity techniques. The method offers a practical, model-agnostic acceleration path for commercial deployment of high-quality video generation, while noting integration challenges with current attention accelerators due to induced unstructured sparsity.

Abstract

The recent surge in video generation has shown the growing demand for high-quality video synthesis using large vision models. Existing video generation models are predominantly based on the video diffusion transformer (vDiT), however, they suffer from substantial inference delay due to self-attention. While prior studies have focused on reducing redundant computations in self-attention, they often overlook the inherent spatio-temporal correlations in video streams and directly leverage sparsity patterns from large language models to reduce attention computations. In this work, we take a principled approach to accelerate self-attention in vDiTs by leveraging the spatio-temporal correlations in the latent space. We show that the attention patterns within vDiT are primarily due to the dominant spatial and temporal correlations at the token channel level. Based on this insight, we propose a lightweight and adaptive reuse strategy that approximates attention computations by reusing partial attention scores of spatially or temporally correlated tokens along individual channels. We demonstrate that our method achieves significantly higher computational savings (85\%) compared to state-of-the-art techniques over 4 vDiTs, while preserving almost identical video quality (0.06\% loss on VBench).

Paper Structure

This paper contains 32 sections, 4 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: An illustration of spatial and temporal patterns in one head of multi-head attention maps. Due to the space limit, only 4 frames are shown here. The attention patterns are determined by the spatial and temporal correlations of the key and query. Spatial-dominated attention (on the left) primarily focuses on the spatial correlations within a frame; thus, the values between two frames are similar and can be reused. Temporal-dominated attention (on the right) primarily focused on the temporal correlations across frames; thus, the values within a frame are similar and can be reused.
  • Figure 2: Examples of attention maps with different patterns. For visualization, we present only a fraction ($\frac{1}{2}$) of the full attention map along each dimension and zoom in the attention scores of $2\times2$ frames on each attention map. On the left, spatially-varying attention maps primarily capture spatial information within individual frames. Spatially-dominated attention maps often have no significant variations across frames; thus, the values across frames are similar. On the right, as temporal-oriented channels dominate, temporally-varying attention maps increasingly focus on temporal correlations across frames; thus, the values within a frame are similar. The color bar shows the magnitude of attention scores and does not correspond to the values of $Q$ or $K$.
  • Figure 3: The overview of vDiT architectures. A vDiT consists of multiple blocks. Generally, each block contains a self-attention layer, a cross-attention layer, and a linear layer.
  • Figure 4: The execution breakdown of four popular vDiT models lin2024openhunyuanhong2022cogvideowan2025 on a single Nvidia H100 (80 GB). The computation of self-attention dominates the execution.
  • Figure 5: The effects of how different channel groups govern the final generation quality. Here, we maliciously reuse all tokens at a particular channel group. The upper part shows how we reuse different channel groups in red arrows. We only show three channels to represent time, the x-axis, and the y-axis channel groups. The lower part shows result figures; the overall videos have similar effects (see supplementary). Note that, our reuse strategy will not introduce artifacts as we show here.
  • ...and 9 more figures