Table of Contents
Fetching ...

Flow caching for autoregressive video generation

Yuexiao Ma, Xuzhe Zheng, Jing Xu, Xiwei Xu, Feng Ling, Xiawu Zheng, Huafeng Kuang, Huixia Li, Xing Wang, Xuefeng Xiao, Fei Chao, Rongrong Ji

TL;DR

FlowCache tackles the high cost of autoregressive video generation by introducing per-chunk caching and KV cache compression for diffusion-based video models. It leverages a theoretical property that the relative L1 distance $L1_{rel}$ between denoising steps decreases over time for each chunk, motivating independent per-chunk caching decisions. Empirically, FlowCache yields 2.38× and 6.7× speedups on MAGI-1 and SkyReels-V2 with minimal VBench changes, establishing a new state-of-the-art for real-time, ultra-long video synthesis. The approach is training-free and plug-and-play, offering substantial memory and computation savings while preserving generation quality.

Abstract

Autoregressive models, often built on Transformer architectures, represent a powerful paradigm for generating ultra-long videos by synthesizing content in sequential chunks. However, this sequential generation process is notoriously slow. While caching strategies have proven effective for accelerating traditional video diffusion models, existing methods assume uniform denoising across all frames-an assumption that breaks down in autoregressive models where different video chunks exhibit varying similarity patterns at identical timesteps. In this paper, we present FlowCache, the first caching framework specifically designed for autoregressive video generation. Our key insight is that each video chunk should maintain independent caching policies, allowing fine-grained control over which chunks require recomputation at each timestep. We introduce a chunkwise caching strategy that dynamically adapts to the unique denoising characteristics of each chunk, complemented by a joint importance-redundancy optimized KV cache compression mechanism that maintains fixed memory bounds while preserving generation quality. Our method achieves remarkable speedups of 2.38 times on MAGI-1 and 6.7 times on SkyReels-V2, with negligible quality degradation (VBench: 0.87 increase and 0.79 decrease respectively). These results demonstrate that FlowCache successfully unlocks the potential of autoregressive models for real-time, ultra-long video generation-establishing a new benchmark for efficient video synthesis at scale. The code is available at https://github.com/mikeallen39/FlowCache.

Flow caching for autoregressive video generation

TL;DR

FlowCache tackles the high cost of autoregressive video generation by introducing per-chunk caching and KV cache compression for diffusion-based video models. It leverages a theoretical property that the relative L1 distance between denoising steps decreases over time for each chunk, motivating independent per-chunk caching decisions. Empirically, FlowCache yields 2.38× and 6.7× speedups on MAGI-1 and SkyReels-V2 with minimal VBench changes, establishing a new state-of-the-art for real-time, ultra-long video synthesis. The approach is training-free and plug-and-play, offering substantial memory and computation savings while preserving generation quality.

Abstract

Autoregressive models, often built on Transformer architectures, represent a powerful paradigm for generating ultra-long videos by synthesizing content in sequential chunks. However, this sequential generation process is notoriously slow. While caching strategies have proven effective for accelerating traditional video diffusion models, existing methods assume uniform denoising across all frames-an assumption that breaks down in autoregressive models where different video chunks exhibit varying similarity patterns at identical timesteps. In this paper, we present FlowCache, the first caching framework specifically designed for autoregressive video generation. Our key insight is that each video chunk should maintain independent caching policies, allowing fine-grained control over which chunks require recomputation at each timestep. We introduce a chunkwise caching strategy that dynamically adapts to the unique denoising characteristics of each chunk, complemented by a joint importance-redundancy optimized KV cache compression mechanism that maintains fixed memory bounds while preserving generation quality. Our method achieves remarkable speedups of 2.38 times on MAGI-1 and 6.7 times on SkyReels-V2, with negligible quality degradation (VBench: 0.87 increase and 0.79 decrease respectively). These results demonstrate that FlowCache successfully unlocks the potential of autoregressive models for real-time, ultra-long video generation-establishing a new benchmark for efficient video synthesis at scale. The code is available at https://github.com/mikeallen39/FlowCache.
Paper Structure (24 sections, 4 theorems, 28 equations, 5 figures, 6 tables)

This paper contains 24 sections, 4 theorems, 28 equations, 5 figures, 6 tables.

Key Result

Theorem 1

Assume the diffusion model $v_{\theta}$ has converged to the optimal velocity field of the flow matching objective. Further, assume the scheduling function is of the power-law form $\sigma(t) = (t/T)^p$ for some constant power $p>0$ and total time $T$. Given $0 < t_{1} < t_{2}\leq T$ and a data chun

Figures (5)

  • Figure 1: Visual comparison of video generation quality among the vanilla model, TeaCache, and our FlowCache.
  • Figure 2: Adjacent-timestep relative L1 distance across denoising trajectories for autoregressive video diffusion models. Denoising progress (%) and relative L1 distance are shown on horizontal and vertical axes, respectively, with colors denoting distinct video chunks. Results for (a) MAGI-1 and (b) SkyReels-V2 reveal three patterns: (i) monotonic increase in relative L1 distance during denoising, confirming Theorem \ref{['the:l1_distance_increase_simplified']}; (ii) substantial metrics divergence among chunks at identical timesteps reflects heterogeneous denoising stages, yielding variable reuse probabilities; (iii) persistently high similarity between model inputs and sampler outputs across all chunks.
  • Figure 3: Comparison of caching strategies in autoregressive video generation. The top part illustrates the Traditional Reuse strategy, which applies a uniform caching policy across all video chunks (i.e., all chunks at the same timestep share the same compute/reuse status, except for the newly initialized denoising chunk, which must be computed). In contrast, our FlowCache (bottom left) employs a chunkwise adaptive caching policy, dynamically deciding for each chunk whether to reuse cached features or perform recomputation based on its own relative L1 distance trajectory. The bottom right panel details FlowCache’s KV cache management: it maintains a Reserved KV Cache that selectively retains historically important and non-redundant tokens.
  • Figure 4: Qualitative results of text-to-video generation on MAGI-1. We present TeaCache, FlowCache-slow, FlowCache-fast, and the Vanilla model. The frames are randomly sampled from the generated video.
  • Figure 5: Qualitative results of text-to-video generation on SkyReels-V2. We present TeaCache, FlowCache-slow, FlowCache-fast, and the Vanilla model. The frames are randomly sampled from the generated video.

Theorems & Definitions (6)

  • Theorem 1
  • Corollary 1
  • Theorem 1
  • Proof 1
  • Corollary 1: Cross-Chunk Divergence of Relative L1 Distance
  • Proof 2