Fast Autoregressive Video Generation with Diagonal Decoding
Yang Ye, Junliang Guo, Haoyu Wu, Tianyu He, Tim Pearce, Tabish Rashid, Katja Hofmann, Jiang Bian
TL;DR
This work introduces Diagonal Decoding (DiagD), a training-free acceleration for autoregressive video generation that exploits spatial and temporal redundancies by generating tokens along diagonals in the spatio-temporal token grid. By enabling parallel token generation along diagonals within frames and overlapping across frames, DiagD achieves up to $10\times$ speedups with minimal fidelity loss, and it remains adaptable across diverse tokenizers and tasks, including video continuation and text-to-video. A lightweight finetuning strategy aligns attention patterns with the diagonal decoding order to close the training-inference gap, especially for smaller models. Extensive experiments on Cosmos, WHAM, and MC-AR demonstrate the method’s generality, robustness to hyperparameters, and practical impact for real-time or streaming video generation in autoregressive frameworks.
Abstract
Autoregressive Transformer models have demonstrated impressive performance in video generation, but their sequential token-by-token decoding process poses a major bottleneck, particularly for long videos represented by tens of thousands of tokens. In this paper, we propose Diagonal Decoding (DiagD), a training-free inference acceleration algorithm for autoregressively pre-trained models that exploits spatial and temporal correlations in videos. Our method generates tokens along diagonal paths in the spatial-temporal token grid, enabling parallel decoding within each frame as well as partially overlapping across consecutive frames. The proposed algorithm is versatile and adaptive to various generative models and tasks, while providing flexible control over the trade-off between inference speed and visual quality. Furthermore, we propose a cost-effective finetuning strategy that aligns the attention patterns of the model with our decoding order, further mitigating the training-inference gap on small-scale models. Experiments on multiple autoregressive video generation models and datasets demonstrate that DiagD achieves up to $10\times$ speedup compared to naive sequential decoding, while maintaining comparable visual fidelity.
