FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion
Zhuokun Chen, Jianfei Cai, Bohan Zhuang
TL;DR
The paper tackles the inefficiency of diffusion models in long-context generation by uncovering cross-step stability in block-external attention and proposing FlashBlock, a caching mechanism that reuses external attention across diffusion steps while recomputing only the internal block. The method operates without altering the diffusion process, and can be integrated with sparse attention as a residual reuse strategy, dramatically reducing attention computation and KV cache access. Empirical results on diffusion language models and video diffusion show up to 1.44× throughput and 1.6× attention-time reduction, with negligible impact on generation quality; when paired with SparseD, accuracy improves at higher sparsity levels and the attention-gap induced by sparsification is mitigated. The approach provides a practical path to scalable long-context diffusion, enabling faster inference for long-form text and video generation while preserving performance across diverse benchmarks.
Abstract
Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy, substantially improving model accuracy under aggressive sparsification. Experiments on diffusion language models and video generation demonstrate up to 1.44$\times$ higher token throughput and up to 1.6$\times$ reduction in attention time, with negligible impact on generation quality. Project page: https://caesarhhh.github.io/FlashBlock/.
