Table of Contents
Fetching ...

FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge

Xuan Shen, Weize Ma, Yufa Zhou, Enhao Tang, Yanyue Xie, Zhengang Li, Yifan Gong, Quanyi Wang, Henghui Ding, Yiwei Wang, Yanzhi Wang, Pu Zhao, Jun Lin, Jiuxiang Gu

TL;DR

FastCar tackles the latency of autoregressive video generation on edge devices by exploiting temporal redundancy with a Temporal Attention Score (TAS) to selectively replay cached MLP outputs. The approach reveals that MLPs dominate decode time and that higher TAS corresponds to smaller output drift, enabling efficient replay with controlled quality loss. An FPGA-based accelerator with Dynamic Resource Scheduling (DRS) supports the replay mechanism, delivering over 2× decoding speedups and improved energy efficiency while complementing sparse attention to reduce drifting in long-duration video generation. The work provides a theoretical foundation for TAS-guided replay and demonstrates practical viability for high-resolution, long-duration AR video generation on edge hardware.

Abstract

Auto-regressive (AR) models, initially successful in language generation, have recently shown promise in visual generation tasks due to their superior sampling efficiency. Unlike image generation, video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during the decoding phase. Our key observations are: (i) MLP modules in the decode phase dominate the inference latency, and (ii) there exists high temporal redundancy in MLP outputs of adjacent frames. In this paper, we propose the \textbf{FastCar} framework to accelerate the decode phase for the AR video generation by exploring the temporal redundancy. The Temporal Attention Score (TAS) is proposed to determine whether to apply the replay strategy (\textit{i.e.}, reusing cached MLP outputs from the previous frame to reduce redundant computations) with detailed theoretical analysis and justification. Also, we develop a hardware accelerator on FPGA with Dynamic Resource Scheduling (DRS) based on TAS to enable better resource utilization and faster inference. Experimental results demonstrate the effectiveness of our method, which outperforms traditional sparse attention approaches with more than 2.1x decoding speedup and higher energy efficiency on the edge. Furthermore, by combining FastCar and sparse attention, FastCar can boost the performance of sparse attention with alleviated drifting, demonstrating our unique advantages for high-resolution and long-duration video generation. Code: https://github.com/shawnricecake/fast-car

FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge

TL;DR

FastCar tackles the latency of autoregressive video generation on edge devices by exploiting temporal redundancy with a Temporal Attention Score (TAS) to selectively replay cached MLP outputs. The approach reveals that MLPs dominate decode time and that higher TAS corresponds to smaller output drift, enabling efficient replay with controlled quality loss. An FPGA-based accelerator with Dynamic Resource Scheduling (DRS) supports the replay mechanism, delivering over 2× decoding speedups and improved energy efficiency while complementing sparse attention to reduce drifting in long-duration video generation. The work provides a theoretical foundation for TAS-guided replay and demonstrates practical viability for high-resolution, long-duration AR video generation on edge hardware.

Abstract

Auto-regressive (AR) models, initially successful in language generation, have recently shown promise in visual generation tasks due to their superior sampling efficiency. Unlike image generation, video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during the decoding phase. Our key observations are: (i) MLP modules in the decode phase dominate the inference latency, and (ii) there exists high temporal redundancy in MLP outputs of adjacent frames. In this paper, we propose the \textbf{FastCar} framework to accelerate the decode phase for the AR video generation by exploring the temporal redundancy. The Temporal Attention Score (TAS) is proposed to determine whether to apply the replay strategy (\textit{i.e.}, reusing cached MLP outputs from the previous frame to reduce redundant computations) with detailed theoretical analysis and justification. Also, we develop a hardware accelerator on FPGA with Dynamic Resource Scheduling (DRS) based on TAS to enable better resource utilization and faster inference. Experimental results demonstrate the effectiveness of our method, which outperforms traditional sparse attention approaches with more than 2.1x decoding speedup and higher energy efficiency on the edge. Furthermore, by combining FastCar and sparse attention, FastCar can boost the performance of sparse attention with alleviated drifting, demonstrating our unique advantages for high-resolution and long-duration video generation. Code: https://github.com/shawnricecake/fast-car

Paper Structure

This paper contains 27 sections, 3 theorems, 28 equations, 7 figures, 7 tables.

Key Result

Theorem 4.4

Let $X \in \mathbb{R}^{n \times d}$ be the hidden states, where each row $x_j \in \mathbb{R}^d$ represents the hidden state of token $j$. Let $\mathsf{Attn}(X)$ denote the attention output defined in Definition def:attn_module. For tokens $j = (t,i)$ and $j^- = (t{-}1,i)$ aligned at the same spatial Let $\gamma := \|W_Q - W_K\|_2$ denote the projection difference. Then, under the Lipschitz continu

Figures (7)

  • Figure 1: Left: FastCar framework. We replay the cache from the previous frame to skip the computations for MLP in decoding. Replay is triggered when the average TAS exceeds a predefined threshold $\tau$. Right Top: Latency cost of both prefill and decode phases for different sequence lengths. Right Bottom: Detailed latency cost of the decode phase for different sequence lengths.
  • Figure 2: Cosine similarity for MLP outputs between neighboring frames for all 32 MLP modules.
  • Figure 3: Left: The top-level block diagram of our hardware accelerator. Right: The DRS diagram.
  • Figure 4: Left: Ablation study comparing consistent vs. inconsistent threshold settings with respect to LPIPS and the VBench total score. Full results are provided in Table \ref{['tab:supp_full_results_threshold_distribution']} at Appendix \ref{['sec:app_additional_results']}. Right: Ablation study on the effect of the threshold $\tau$ on replay ratio and VBench total score. Full results are reported in Table \ref{['tab:supp_full_results_threshold_value']} at Appendix \ref{['sec:app_additional_results']}.
  • Figure 5: Replay ratio distribution across layers for thresholds $\tau = -1$, $-2$, and $-4$, respectively.
  • ...and 2 more figures

Theorems & Definitions (10)

  • Definition 4.1: Attention Module
  • Definition 4.2: MLP Module
  • Definition 4.3: Temporal Attention Score
  • Theorem 4.4: Attention Score Controls Attention Output Difference
  • Remark 4.5
  • Theorem 4.6: Attention and Input Similarity Implies MLP Output Similarity
  • Theorem 4.7: Temporal Attention Score Controls MLP Output Similarity
  • proof : Proof of Theorem \ref{['thm:attn_score_dif']}
  • proof : Proof of Theorem \ref{['thm:attn_mlp_dif']}
  • proof : Proof of Theorem \ref{['thm:attn_score_mlp']}