Brick-Diffusion: Generating Long Videos with Brick-to-Wall Denoising
Yunlong Yuan, Yuanfan Guo, Chunwei Wang, Hang Xu, Li Zhang
TL;DR
This work tackles long video generation with pre-trained short diffusion models in a training-free setting. It introduces Brick-Diffusion and its brick-to-wall denoising, which slices latents into short segments and denoises them while shifting with a stride to promote cross-segment communication. The authors provide detailed procedures for latent slicing, offset computation, and parallelized denoising, and validate the method against several baselines on VBench, showing improved overall fidelity and dynamics. The approach enables scalable, high-quality long-video synthesis without fine-tuning, with practical implications for content creation and simulation.
Abstract
Recent advances in diffusion models have greatly improved text-driven video generation. However, training models for long video generation demands significant computational power and extensive data, leading most video diffusion models to be limited to a small number of frames. Existing training-free methods that attempt to generate long videos using pre-trained short video diffusion models often struggle with issues such as insufficient motion dynamics and degraded video fidelity. In this paper, we present Brick-Diffusion, a novel, training-free approach capable of generating long videos of arbitrary length. Our method introduces a brick-to-wall denoising strategy, where the latent is denoised in segments, with a stride applied in subsequent iterations. This process mimics the construction of a staggered brick wall, where each brick represents a denoised segment, enabling communication between frames and improving overall video quality. Through quantitative and qualitative evaluations, we demonstrate that Brick-Diffusion outperforms existing baseline methods in generating high-fidelity videos.
