Table of Contents
Fetching ...

FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion

Zhuokun Chen, Jianfei Cai, Bohan Zhuang

TL;DR

The paper tackles the inefficiency of diffusion models in long-context generation by uncovering cross-step stability in block-external attention and proposing FlashBlock, a caching mechanism that reuses external attention across diffusion steps while recomputing only the internal block. The method operates without altering the diffusion process, and can be integrated with sparse attention as a residual reuse strategy, dramatically reducing attention computation and KV cache access. Empirical results on diffusion language models and video diffusion show up to 1.44× throughput and 1.6× attention-time reduction, with negligible impact on generation quality; when paired with SparseD, accuracy improves at higher sparsity levels and the attention-gap induced by sparsification is mitigated. The approach provides a practical path to scalable long-context diffusion, enabling faster inference for long-form text and video generation while preserving performance across diverse benchmarks.

Abstract

Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy, substantially improving model accuracy under aggressive sparsification. Experiments on diffusion language models and video generation demonstrate up to 1.44$\times$ higher token throughput and up to 1.6$\times$ reduction in attention time, with negligible impact on generation quality. Project page: https://caesarhhh.github.io/FlashBlock/.

FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion

TL;DR

The paper tackles the inefficiency of diffusion models in long-context generation by uncovering cross-step stability in block-external attention and proposing FlashBlock, a caching mechanism that reuses external attention across diffusion steps while recomputing only the internal block. The method operates without altering the diffusion process, and can be integrated with sparse attention as a residual reuse strategy, dramatically reducing attention computation and KV cache access. Empirical results on diffusion language models and video diffusion show up to 1.44× throughput and 1.6× attention-time reduction, with negligible impact on generation quality; when paired with SparseD, accuracy improves at higher sparsity levels and the attention-gap induced by sparsification is mitigated. The approach provides a practical path to scalable long-context diffusion, enabling faster inference for long-form text and video generation while preserving performance across diverse benchmarks.

Abstract

Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy, substantially improving model accuracy under aggressive sparsification. Experiments on diffusion language models and video generation demonstrate up to 1.44 higher token throughput and up to 1.6 reduction in attention time, with negligible impact on generation quality. Project page: https://caesarhhh.github.io/FlashBlock/.
Paper Structure (14 sections, 11 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 14 sections, 11 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Cross-step stability of block-external vs. block-internal attention across diffusion steps. Visualization of attention similarity across diffusion steps for the same block at layer 3 of Trado-8B-Thinking. We compute the similarity of attention outputs between each diffusion step and its subsequent step within a denoising block. In each heatmap, the x-axis corresponds to token indices within the current block at step $s$, and the y-axis corresponds to token indices within the same block at step $s{+}1$; diagonal entries therefore represent the similarity of the same token across adjacent diffusion steps. The top row shows block-external attention ($A_{\text{out}}$), and the bottom row shows block-internal attention ($A_{\text{in}}$), across multiple diffusion steps. Brighter colors indicate higher similarity. Across steps, $A_{\text{out}}$ consistently exhibits higher similarity and more coherent structure, indicating strong cross-step stability, whereas $A_{\text{in}}$ varies substantially across steps.
  • Figure 2: Block-external attention caching for block diffusion. At each diffusion step, block diffusion updates a contiguous block of tokens. Our method caches attention contributions from block-external tokens and reuses them across steps, recomputing attention only within the current block. Block-internal and block-external attention are combined via log-space aggregation, reducing computation and memory I/O in long-context settings.
  • Figure 3: Per-step inference latency under increasing context length. We report results on Trado with batch size 128 using two A100 GPUs. Each column corresponds to a different updated-token threshold $\tau \in \{2,3,4\}$. Our method (orange) consistently reduces per-step inference latency compared to the Trado baseline (blue), with the gap widening as context length increases. Larger $\tau$ values enable more aggressive reuse of cached block-external attention, further reducing computation and memory access.
  • Figure 4: Qualitative comparison on video generation with LongLive-1.3B. We visualize video examples from VBench, each shown by six uniformly sampled frames. For each example, the top row shows results from the baseline, the mid row shows results from the SpargeAttention, and the bottom row shows results from SpargeAttention combined with our block-external attention caching at a fixed sparsity ratio. Our method preserves visual quality and temporal consistency while improving inference efficiency, demonstrating that block-external attention caching does not introduce perceptible degradation in generated videos.
  • Figure 5: Attention similarity across diffusion steps in video diffusion models. We visualize the cosine similarity of attention outputs between adjacent diffusion steps for block-internal (orange) and block-external (blue) attention components across all layers and attention heads. Each subplot corresponds to one transformer layer, with the horizontal axis indexing attention heads.
  • ...and 5 more figures