Table of Contents
Fetching ...

Streaming Autoregressive Video Generation via Diagonal Distillation

Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-HsuanYang, Weiyang Liu

TL;DR

Diagonal Distillation is proposed, which operates orthogonally to existing approaches and better exploits temporal information across both video chunks and denoising steps and mitigates error propagation and reduces oversaturation in long-range sequences.

Abstract

Large pretrained diffusion models have significantly enhanced the quality of generated videos, and yet their use in real-time streaming remains limited. Autoregressive models offer a natural framework for sequential frame synthesis but require heavy computation to achieve high fidelity. Diffusion distillation can compress these models into efficient few-step variants, but existing video distillation approaches largely adapt image-specific methods that neglect temporal dependencies. These techniques often excel in image generation but underperform in video synthesis, exhibiting reduced motion coherence, error accumulation over long sequences, and a latency-quality trade-off. We identify two factors that result in these limitations: insufficient utilization of temporal context during step reduction and implicit prediction of subsequent noise levels in next-chunk prediction (i.e., exposure bias). To address these issues, we propose Diagonal Distillation, which operates orthogonally to existing approaches and better exploits temporal information across both video chunks and denoising steps. Central to our approach is an asymmetric generation strategy: more steps early, fewer steps later. This design allows later chunks to inherit rich appearance information from thoroughly processed early chunks, while using partially denoised chunks as conditional inputs for subsequent synthesis. By aligning the implicit prediction of subsequent noise levels during chunk generation with the actual inference conditions, our approach mitigates error propagation and reduces oversaturation in long-range sequences. We further incorporate implicit optical flow modeling to preserve motion quality under strict step constraints. Our method generates a 5-second video in 2.61 seconds (up to 31 FPS), achieving a 277.3x speedup over the undistilled model.

Streaming Autoregressive Video Generation via Diagonal Distillation

TL;DR

Diagonal Distillation is proposed, which operates orthogonally to existing approaches and better exploits temporal information across both video chunks and denoising steps and mitigates error propagation and reduces oversaturation in long-range sequences.

Abstract

Large pretrained diffusion models have significantly enhanced the quality of generated videos, and yet their use in real-time streaming remains limited. Autoregressive models offer a natural framework for sequential frame synthesis but require heavy computation to achieve high fidelity. Diffusion distillation can compress these models into efficient few-step variants, but existing video distillation approaches largely adapt image-specific methods that neglect temporal dependencies. These techniques often excel in image generation but underperform in video synthesis, exhibiting reduced motion coherence, error accumulation over long sequences, and a latency-quality trade-off. We identify two factors that result in these limitations: insufficient utilization of temporal context during step reduction and implicit prediction of subsequent noise levels in next-chunk prediction (i.e., exposure bias). To address these issues, we propose Diagonal Distillation, which operates orthogonally to existing approaches and better exploits temporal information across both video chunks and denoising steps. Central to our approach is an asymmetric generation strategy: more steps early, fewer steps later. This design allows later chunks to inherit rich appearance information from thoroughly processed early chunks, while using partially denoised chunks as conditional inputs for subsequent synthesis. By aligning the implicit prediction of subsequent noise levels during chunk generation with the actual inference conditions, our approach mitigates error propagation and reduces oversaturation in long-range sequences. We further incorporate implicit optical flow modeling to preserve motion quality under strict step constraints. Our method generates a 5-second video in 2.61 seconds (up to 31 FPS), achieving a 277.3x speedup over the undistilled model.
Paper Structure (38 sections, 15 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 38 sections, 15 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: Diagonal Distillation achieves comparable quality to the full-step model while significantly reducing latency. The method yields a 1.88× speedup on 5-second short video generation on a single H100 GPU.
  • Figure 2: When the training data uses explicit noise frames as conditions in Causvid yin2025slow, the next chunk prediction essentially functions as an implicit next noise level prediction. We observe that even with single-step prediction, the image progressively becomes clearer.
  • Figure 3: Diagonal Denoising with Diagonal Forcing and Progressive Step Reduction. We give an illustration of our method by starting with five denoising steps for the first chunk and gradually reducing them to two steps by Chunk 7. For chunks with $k \geq 4$, we use a fixed two-step denoising process, reusing the Key-Value (KV) cache from the final noisy frame of the preceding chunk. This design preserves temporal coherence while minimizing latency, and the corresponding pseudo-code is provided in the appendix.
  • Figure 4: Comparing the results from three different models. For more results, please refer to the Appendices.
  • Figure 5: Ablation study results. (a) Performance evaluation across different diagonal forcing timesteps, demonstrating optimal outcomes at 100 steps (1000 steps correspond to complete noise addition, while 0 steps represent the clean frame);(b)Impact of motion loss weight on model performance.
  • ...and 5 more figures