Table of Contents
Fetching ...

Fast Autoregressive Video Generation with Diagonal Decoding

Yang Ye, Junliang Guo, Haoyu Wu, Tianyu He, Tim Pearce, Tabish Rashid, Katja Hofmann, Jiang Bian

TL;DR

This work introduces Diagonal Decoding (DiagD), a training-free acceleration for autoregressive video generation that exploits spatial and temporal redundancies by generating tokens along diagonals in the spatio-temporal token grid. By enabling parallel token generation along diagonals within frames and overlapping across frames, DiagD achieves up to $10\times$ speedups with minimal fidelity loss, and it remains adaptable across diverse tokenizers and tasks, including video continuation and text-to-video. A lightweight finetuning strategy aligns attention patterns with the diagonal decoding order to close the training-inference gap, especially for smaller models. Extensive experiments on Cosmos, WHAM, and MC-AR demonstrate the method’s generality, robustness to hyperparameters, and practical impact for real-time or streaming video generation in autoregressive frameworks.

Abstract

Autoregressive Transformer models have demonstrated impressive performance in video generation, but their sequential token-by-token decoding process poses a major bottleneck, particularly for long videos represented by tens of thousands of tokens. In this paper, we propose Diagonal Decoding (DiagD), a training-free inference acceleration algorithm for autoregressively pre-trained models that exploits spatial and temporal correlations in videos. Our method generates tokens along diagonal paths in the spatial-temporal token grid, enabling parallel decoding within each frame as well as partially overlapping across consecutive frames. The proposed algorithm is versatile and adaptive to various generative models and tasks, while providing flexible control over the trade-off between inference speed and visual quality. Furthermore, we propose a cost-effective finetuning strategy that aligns the attention patterns of the model with our decoding order, further mitigating the training-inference gap on small-scale models. Experiments on multiple autoregressive video generation models and datasets demonstrate that DiagD achieves up to $10\times$ speedup compared to naive sequential decoding, while maintaining comparable visual fidelity.

Fast Autoregressive Video Generation with Diagonal Decoding

TL;DR

This work introduces Diagonal Decoding (DiagD), a training-free acceleration for autoregressive video generation that exploits spatial and temporal redundancies by generating tokens along diagonals in the spatio-temporal token grid. By enabling parallel token generation along diagonals within frames and overlapping across frames, DiagD achieves up to speedups with minimal fidelity loss, and it remains adaptable across diverse tokenizers and tasks, including video continuation and text-to-video. A lightweight finetuning strategy aligns attention patterns with the diagonal decoding order to close the training-inference gap, especially for smaller models. Extensive experiments on Cosmos, WHAM, and MC-AR demonstrate the method’s generality, robustness to hyperparameters, and practical impact for real-time or streaming video generation in autoregressive frameworks.

Abstract

Autoregressive Transformer models have demonstrated impressive performance in video generation, but their sequential token-by-token decoding process poses a major bottleneck, particularly for long videos represented by tens of thousands of tokens. In this paper, we propose Diagonal Decoding (DiagD), a training-free inference acceleration algorithm for autoregressively pre-trained models that exploits spatial and temporal correlations in videos. Our method generates tokens along diagonal paths in the spatial-temporal token grid, enabling parallel decoding within each frame as well as partially overlapping across consecutive frames. The proposed algorithm is versatile and adaptive to various generative models and tasks, while providing flexible control over the trade-off between inference speed and visual quality. Furthermore, we propose a cost-effective finetuning strategy that aligns the attention patterns of the model with our decoding order, further mitigating the training-inference gap on small-scale models. Experiments on multiple autoregressive video generation models and datasets demonstrate that DiagD achieves up to speedup compared to naive sequential decoding, while maintaining comparable visual fidelity.

Paper Structure

This paper contains 40 sections, 9 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Comparisons between naive Next-Token Prediction (NTP) and Diagonal Decoding (DiagD) on Cosmos agarwal2025cosmos autoregressive models. The subscription of DiagD indicates different choices of the hyperparameters $k$. As a result, DiagD achieves a $10\times$ speedup with little degradation on visual quality.
  • Figure 2: An illustration of the proposed Diagonal Decoding algorithm with $d=3$ and $k=1$. Spatially, tokens along the same diagonal within each frame are generated in parallel. Temporally, our method generates the top-left tokens of the subsequent frame before completing the current frame.
  • Figure 3: Human evaluation results for Cosmos-12B with DiagD and MC-AR 700M with or without DiagD finetuning. In the figure, "Win" indicates the left setting outperforms the right one, while "Lose" represents the opposite. The results indicate that DiagD achieves similar performance to NTP, and fine-tuning helps it perform even better.
  • Figure 4: The attention scores of the second frame in the Cosmos-4B model are shown. The bright slash lines indicate that substantial attention scores are assigned to tokens at regular intervals, corresponding to those in temporally and spatially adjacent positions. The shown attention map is the mean value of all self-attention layers in the model.
  • Figure 5: Qualitative analysis of Cosmos and WHAM. Videos generated by Cosmos-12B and 1.6B WHAM models using the next-token prediction paradigm (second row) and Diagonal Decoding under different configurations (bottom two rows). The ground truth is presented in the first row. We sample every 6 frames from the generated videos in Cosmos and every 8 frames from those in WHAM.
  • ...and 1 more figures