Table of Contents
Fetching ...

BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation

Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, Bohan Zhuang

TL;DR

BlockVid addresses the challenge of generating minute-long videos by uniting semi-autoregressive block diffusion with a semantic sparse KV cache, Block Forcing training, and chunk-aware noise scheduling. The method mitigates chunk-wise error propagation and enhances long-range temporal coherence, while LV-Bench provides a granular benchmark for coherence over extended durations. Empirical results on LV-Bench and VBench show BlockVid achieving state-of-the-art performance across coherence and perceptual quality metrics, including notable improvements in VDE Subject and VDE Clarity. The work advances practical long-form video generation and offers a robust evaluation framework for future research in world-model–level video synthesis.

Abstract

Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We further introduce LV-Bench, a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence. Extensive experiments on VBench and LV-Bench demonstrate that BlockVid consistently outperforms existing methods in generating high-quality, coherent minute-long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV-Bench over the state of the art approaches. Project website: https://ziplab.co/BlockVid. Inferix (Code): https://github.com/alibaba-damo-academy/Inferix.

BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation

TL;DR

BlockVid addresses the challenge of generating minute-long videos by uniting semi-autoregressive block diffusion with a semantic sparse KV cache, Block Forcing training, and chunk-aware noise scheduling. The method mitigates chunk-wise error propagation and enhances long-range temporal coherence, while LV-Bench provides a granular benchmark for coherence over extended durations. Empirical results on LV-Bench and VBench show BlockVid achieving state-of-the-art performance across coherence and perceptual quality metrics, including notable improvements in VDE Subject and VDE Clarity. The work advances practical long-form video generation and offers a robust evaluation framework for future research in world-model–level video synthesis.

Abstract

Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We further introduce LV-Bench, a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence. Extensive experiments on VBench and LV-Bench demonstrate that BlockVid consistently outperforms existing methods in generating high-quality, coherent minute-long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV-Bench over the state of the art approaches. Project website: https://ziplab.co/BlockVid. Inferix (Code): https://github.com/alibaba-damo-academy/Inferix.

Paper Structure

This paper contains 36 sections, 26 equations, 9 figures, 7 tables, 2 algorithms.

Figures (9)

  • Figure 1: Architecture comparison: AR vs. Diffusion vs. Block Diffusion (Semi-AR). Our BlockVid aims to tackle the chunk-wise accumulation error of block diffusion, enabling high-fidelity and coherent minute-long video generation.
  • Figure 2: Comparison of visualization results between our method and different baselines in terms of accumulation error. Details can be found in Appendix \ref{['app:vis-compare']}.
  • Figure 3: Overview of the BlockVid semi-AR framework. The generation of chunk $c+1$ is conditioned on both a local KV cache and a globally retrieved context. The global context is dynamically assembled by retrieving top-$l$ semantically similar KV chunks via prompt embedding similarity. Upon generation, the bank is updated with the new chunk's most salient KV tokens
  • Figure 4: More visualization results #1.
  • Figure 5: More visualization results #2.
  • ...and 4 more figures