Table of Contents
Fetching ...

KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study

Suraj Ranganath, Vaishak Menon, Anish Patnaik

Abstract

Self-forcing video generation extends a short-horizon video model to longer rollouts by repeatedly feeding generated content back in as context. This scaling path immediately exposes a systems bottleneck: the key-value (KV) cache grows with rollout length, so longer videos require not only better generation quality but also substantially better memory behavior. We present a comprehensive empirical study of KV-cache compression for self-forcing video generation on a Wan2.1-based Self-Forcing stack. Our study covers 33 quantization and cache-policy variants, 610 prompt-level observations, and 63 benchmark-level summaries across two evaluation settings: MovieGen for single-shot 10-second generation and StoryEval for longer narrative-style stability. We jointly evaluate peak VRAM, runtime, realized compression ratio, VBench imaging quality, BF16-referenced fidelity (SSIM, LPIPS, PSNR), and terminal drift. Three findings are robust. First, the strongest practical operating region is a FlowCache-inspired soft-prune INT4 adaptation, which reaches 5.42-5.49x compression while reducing peak VRAM from 19.28 GB to about 11.7 GB with only modest runtime overhead. Second, the highest-fidelity compressed methods, especially PRQ_INT4 and QUAROT_KV_INT4, are not the best deployment choices because they preserve quality at severe runtime or memory cost. Third, nominal compression alone is not sufficient: several methods shrink KV storage but still exceed BF16 peak VRAM because the current integration reconstructs or retains large BF16 buffers during attention and refresh stages. The result is a benchmark harness, analysis workflow, and empirical map of which KV-cache ideas are practical today and which are promising research directions for better memory integration. Code, data products, and the presentation dashboard are available at https://github.com/suraj-ranganath/kv-quant-longhorizon/.

KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study

Abstract

Self-forcing video generation extends a short-horizon video model to longer rollouts by repeatedly feeding generated content back in as context. This scaling path immediately exposes a systems bottleneck: the key-value (KV) cache grows with rollout length, so longer videos require not only better generation quality but also substantially better memory behavior. We present a comprehensive empirical study of KV-cache compression for self-forcing video generation on a Wan2.1-based Self-Forcing stack. Our study covers 33 quantization and cache-policy variants, 610 prompt-level observations, and 63 benchmark-level summaries across two evaluation settings: MovieGen for single-shot 10-second generation and StoryEval for longer narrative-style stability. We jointly evaluate peak VRAM, runtime, realized compression ratio, VBench imaging quality, BF16-referenced fidelity (SSIM, LPIPS, PSNR), and terminal drift. Three findings are robust. First, the strongest practical operating region is a FlowCache-inspired soft-prune INT4 adaptation, which reaches 5.42-5.49x compression while reducing peak VRAM from 19.28 GB to about 11.7 GB with only modest runtime overhead. Second, the highest-fidelity compressed methods, especially PRQ_INT4 and QUAROT_KV_INT4, are not the best deployment choices because they preserve quality at severe runtime or memory cost. Third, nominal compression alone is not sufficient: several methods shrink KV storage but still exceed BF16 peak VRAM because the current integration reconstructs or retains large BF16 buffers during attention and refresh stages. The result is a benchmark harness, analysis workflow, and empirical map of which KV-cache ideas are practical today and which are promising research directions for better memory integration. Code, data products, and the presentation dashboard are available at https://github.com/suraj-ranganath/kv-quant-longhorizon/.

Paper Structure

This paper contains 19 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Global systems-quality landscape on MovieGen. FlowCache-inspired prune/soft-prune methods dominate the practical low-VRAM region, PRQ and QuaRot occupy the high-fidelity region, and spatially mixed methods collapse despite plausible motivation.
  • Figure 2: Global systems-quality landscape on StoryEval. The qualitative structure mirrors MovieGen: the practical winner remains in the FlowCache-inspired soft-prune region, while the highest-fidelity compressed methods sit at much higher runtime or peak-memory cost.
  • Figure 3: Pareto/frontier analysis on MovieGen. Methods that survive the quality-preserving compression frontier are not the same methods that survive the systems-efficiency frontier, which is why deployment winners and research-quality winners diverge.
  • Figure 4: Pareto/frontier analysis on StoryEval. The same separation persists under narrative-style rollout stability, reinforcing that the design-space conclusions are not artifacts of one benchmark.
  • Figure 5: Curated six-method qualitative comparisons for two MovieGen examples and one StoryEval example.
  • ...and 5 more figures