Table of Contents
Fetching ...

Accelerating Text-to-Video Generation with Calibrated Sparse Attention

Shai Yehezkel, Shahar Yadin, Noam Elata, Yaron Ostrovsky-Berman, Bahjat Kawar

TL;DR

It is identified that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries, so the attention computation in these cases can be skipped with little to no effect on the result.

Abstract

Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.

Accelerating Text-to-Video Generation with Calibrated Sparse Attention

TL;DR

It is identified that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries, so the attention computation in these cases can be skipped with little to no effect on the result.

Abstract

Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.
Paper Structure (35 sections, 11 equations, 19 figures, 6 tables)

This paper contains 35 sections, 11 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: Comparison of two prompts and resolutions generated with the same seed on Wan2.1 14B text-to-video. CalibAtt achieves higher attention sparsity and lower end-to-end latency while maintaining visual quality and prompt alignment.
  • Figure 2: Attention patterns across timesteps ($t$), layers ($l$), and heads ($h$). We compare post-softmax attention maps (queries $\times$ keys) for different $t,l,h$ for the same prompt with Wan 2.1 14B wan2025. Each row fixes the attention granularity (token-level or block-level). For ease of visualization, we show the first $12544$ tokens out of the full sequence length of ${32760}$. Notably, the large block structure visible in some of the maps reflects intra-frame token correspondences in the video.
  • Figure 3: Data-independence of block sparsity.(a) Attention maps from layer 20, head 24, at timestep 10 across four different prompts, showing consistent sparsity patterns. (b) Histogram showing how often each block is marked to be kept across calibration prompts. A value of 0 means the block is skipped for all prompts, while 1 means it is always computed. Many blocks cluster near 0 or 1, indicating a largely data-independent sparsity pattern. The curve (purple) shows the cumulative fraction of blocks skipped under different agreement thresholds.
  • Figure 4: Spatial repetition within frames.(a) Token-level attention map at layer 30, head 24, timestep 0, showing two frames ($3120$ of $32760$ tokens). Frame boundaries are marked in red. (b) Zoomed-in slices from each frame-to-frame block with white grid lines separating spatial rows. The attention pattern repeats across spatial rows within each query frame.
  • Figure 5: Schematic description of CalibAtt.(a) We threshold the top key blocks per query block. Then, we aggregate the resulting masks across prompts and store them in a mask dictionary. In addition, we identify attention heads that exhibit spatial row repetition and store them in a dictionary. (b) At inference time, for non-repetitive heads (top), we load block masks into memory and skip the computation of the unset blocks accordingly. For heads flagged as spatially repetitive (bottom), we compute attention only for selected anchor rows per frame and broadcast the outputs to neighboring rows.
  • ...and 14 more figures