Table of Contents
Fetching ...

MonarchRT: Efficient Attention for Real-Time Video Generation

Krish Agarwal, Zhuoming Chen, Cheng Luo, Yongqi Chen, Haizhong Zheng, Xun Huang, Atri Rudra, Beidi Chen

TL;DR

This work tackles the bottleneck of real-time video generation caused by the quadratic cost of 3D self-attention in diffusion transformers. It introduces MonarchRT, a structured Monarch-based attention parameterization with aligned blocks and a tiled extension, enabling highly expressive yet efficient attention without sacrificing quality. Through training-based finetuning and optimized Triton kernels, MonarchRT achieves up to 95% attention sparsity with negligible quality loss and substantial speedups over state-of-the-art FlashAttention variants, including true real-time 16 FPS generation on a single RTX 5090 for Self-Forcing. The approach significantly broadens the practicality of autoregressive video generation, offering strong performance gains for both autoregressive and bidirectional diffusion models and paving the way for deployment on consumer hardware.

Abstract

Real-time video generation with Diffusion Transformers is bottlenecked by the quadratic cost of 3D self-attention, especially in real-time regimes that are both few-step and autoregressive, where errors compound across time and each denoising step must carry substantially more information. In this setting, we find that prior sparse-attention approximations break down, despite showing strong results for bidirectional, many-step diffusion. Specifically, we observe that video attention is not reliably sparse, but instead combines pronounced periodic structure driven by spatiotemporal position with dynamic, sparse semantic correspondences and dense mixing, exceeding the representational capacity of even oracle top-k attention. Building on this insight, we propose Monarch-RT, a structured attention parameterization for video diffusion models that factorizes attention using Monarch matrices. Through appropriately aligned block structure and our extended tiled Monarch parameterization, we achieve high expressivity while preserving computational efficiency. We further overcome the overhead of parameterization through finetuning, with custom Triton kernels. We first validate the high efficacy of Monarch-RT over existing sparse baselines designed only for bidirectional models. We further observe that Monarch-RT attains up to 95% attention sparsity with no loss in quality when applied to the state-of-the-art model Self-Forcing, making Monarch-RT a pioneering work on highly-capable sparse attention parameterization for real-time video generation. Our optimized implementation outperforms FlashAttention-2, FlashAttention-3, and FlashAttention-4 kernels on Nvidia RTX 5090, H100, and B200 GPUs respectively, providing kernel speedups in the range of 1.4-11.8X. This enables us, for the first time, to achieve true real-time video generation with Self-Forcing at 16 FPS on a single RTX 5090.

MonarchRT: Efficient Attention for Real-Time Video Generation

TL;DR

This work tackles the bottleneck of real-time video generation caused by the quadratic cost of 3D self-attention in diffusion transformers. It introduces MonarchRT, a structured Monarch-based attention parameterization with aligned blocks and a tiled extension, enabling highly expressive yet efficient attention without sacrificing quality. Through training-based finetuning and optimized Triton kernels, MonarchRT achieves up to 95% attention sparsity with negligible quality loss and substantial speedups over state-of-the-art FlashAttention variants, including true real-time 16 FPS generation on a single RTX 5090 for Self-Forcing. The approach significantly broadens the practicality of autoregressive video generation, offering strong performance gains for both autoregressive and bidirectional diffusion models and paving the way for deployment on consumer hardware.

Abstract

Real-time video generation with Diffusion Transformers is bottlenecked by the quadratic cost of 3D self-attention, especially in real-time regimes that are both few-step and autoregressive, where errors compound across time and each denoising step must carry substantially more information. In this setting, we find that prior sparse-attention approximations break down, despite showing strong results for bidirectional, many-step diffusion. Specifically, we observe that video attention is not reliably sparse, but instead combines pronounced periodic structure driven by spatiotemporal position with dynamic, sparse semantic correspondences and dense mixing, exceeding the representational capacity of even oracle top-k attention. Building on this insight, we propose Monarch-RT, a structured attention parameterization for video diffusion models that factorizes attention using Monarch matrices. Through appropriately aligned block structure and our extended tiled Monarch parameterization, we achieve high expressivity while preserving computational efficiency. We further overcome the overhead of parameterization through finetuning, with custom Triton kernels. We first validate the high efficacy of Monarch-RT over existing sparse baselines designed only for bidirectional models. We further observe that Monarch-RT attains up to 95% attention sparsity with no loss in quality when applied to the state-of-the-art model Self-Forcing, making Monarch-RT a pioneering work on highly-capable sparse attention parameterization for real-time video generation. Our optimized implementation outperforms FlashAttention-2, FlashAttention-3, and FlashAttention-4 kernels on Nvidia RTX 5090, H100, and B200 GPUs respectively, providing kernel speedups in the range of 1.4-11.8X. This enables us, for the first time, to achieve true real-time video generation with Self-Forcing at 16 FPS on a single RTX 5090.
Paper Structure (41 sections, 2 theorems, 42 equations, 13 figures, 12 tables)

This paper contains 41 sections, 2 theorems, 42 equations, 13 figures, 12 tables.

Key Result

Theorem 3.1

(informal) The 3D attention matrix $\boldsymbol{A} \in \mathbb{R}^{fhw \times fhw}$ defined above admits a structural decomposition where $\boldsymbol{P}$ is a permutation matrix, $\boldsymbol{D'}$ is blockwise rank-1 with block sizes $(b_1, b_2)$, which satisfies $b_1b_2 = fhw$ and $\boldsymbol{S}$ is a sparse matrix.

Figures (13)

  • Figure 1: Left: MSE of oracle top-$k$ and Monarch parameterizations of an attention map compared to the original dense attention map for varying levels of sparsity. Results shown for two different layers/heads on Self-Forcing. Monarch incurs much lower error for high levels of sparsity. Right: Example generations on the same prompt on Self-Forcing. First row shows exact top-$k$ with 10% density still produces poor quality output. Third row shows that using inference-only MonarchAttention (with aligned block sizes) produces higher output quality with a lower parameter count of 8.6%. Second row shows that, although the parameter count is increased to 10.6%, inference-only MonarchAttention with misaligned block sizes incurs pixel-level permutation effects. Parameter count refers to the number of parameters used to estimate the full attention map.
  • Figure 2: Illustration of Regular and Tiled Monarch Parameterization. Top: An example of Monarch parameterization applied to a $12 \times 12$ matrix with block size $(b_1, b_2) = (3, 4)$. Bottom: An example of tiled Monarch parameterization applied with block size $(b_1, b_2) = (3, 2)$. (1) The original matrix is first permuted to expose an implicit block-wise low-rank structure. (2) After permutation, the matrix is reorganized into blocks of size $b_1 \times b_2$, where each block corresponds to a group of rows and columns. (3) Each block is then independently decomposed into low-rank factors. Overall, Monarch represents the matrix as $PLP^{\top}R$, where $P$ denotes a permutation matrix, $PLP^{\top}$ is a block-wise diagonal matrix, and $R$ is a block-diagonal matrix. The tiled Monarch parameterization has 2$\times$ the parameter count and is strictly more expressive than the regular Monarch parameterization.
  • Figure 3: Overview of the MonarchAttention pipeline. Given query and key matrices $(\boldsymbol{Q}, \boldsymbol{K})$, MonarchAttention iteratively refines the Monarch factors $\boldsymbol{L}$ and $\boldsymbol{R}$, each composed of sparse block-diagonal matrices. At each iteration, one factor is updated while the other is held fixed, without explicitly materializing the full attention matrix. Despite the highly structured and sparse parameterization of $\boldsymbol{L}$ and $\boldsymbol{R}$, the resulting attention matrix $\boldsymbol{A} \approx \boldsymbol{P}\boldsymbol{L}\boldsymbol{P}^{\top}\boldsymbol{R}$ is dense, highlighting the strong expressiveness of Monarch parameterization. Algorithmic details are provided in Appendix \ref{['app:algodetails']}.
  • Figure 4: Left: Percent of key-value tokens that fall under the top-$p$ threshold per head in Self-Forcing, averaged across all query tokens for several decoding iterations, on the first denoising iteration. Results shown for 5 randomly sampled heads/layers. textbfRight: Example generations on the same prompt on Self-Forcing using MonarchRT with 10 and 1 iterations of iterative refinement. 10 iterations has much higher quality but is practically inefficient, so we recover the accuracy of 1 iteration with training.
  • Figure 5: An illustration of our modeling of the 3D attention map in \ref{['eq:model']}. Left: the shape of the video. Right: an example attention map. The periodic diagonal bands arise from spatiotemporal positional structure, while the large activation at position $(8,2)$ reflects a semantic relationship that is independent of position, requiring dynamic (retrieval-based) sparse attention to capture.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Theorem 3.1
  • Theorem 4.1: Strict expressiveness of tiled Monarch