Table of Contents
Fetching ...

Brick-Diffusion: Generating Long Videos with Brick-to-Wall Denoising

Yunlong Yuan, Yuanfan Guo, Chunwei Wang, Hang Xu, Li Zhang

TL;DR

This work tackles long video generation with pre-trained short diffusion models in a training-free setting. It introduces Brick-Diffusion and its brick-to-wall denoising, which slices latents into short segments and denoises them while shifting with a stride to promote cross-segment communication. The authors provide detailed procedures for latent slicing, offset computation, and parallelized denoising, and validate the method against several baselines on VBench, showing improved overall fidelity and dynamics. The approach enables scalable, high-quality long-video synthesis without fine-tuning, with practical implications for content creation and simulation.

Abstract

Recent advances in diffusion models have greatly improved text-driven video generation. However, training models for long video generation demands significant computational power and extensive data, leading most video diffusion models to be limited to a small number of frames. Existing training-free methods that attempt to generate long videos using pre-trained short video diffusion models often struggle with issues such as insufficient motion dynamics and degraded video fidelity. In this paper, we present Brick-Diffusion, a novel, training-free approach capable of generating long videos of arbitrary length. Our method introduces a brick-to-wall denoising strategy, where the latent is denoised in segments, with a stride applied in subsequent iterations. This process mimics the construction of a staggered brick wall, where each brick represents a denoised segment, enabling communication between frames and improving overall video quality. Through quantitative and qualitative evaluations, we demonstrate that Brick-Diffusion outperforms existing baseline methods in generating high-fidelity videos.

Brick-Diffusion: Generating Long Videos with Brick-to-Wall Denoising

TL;DR

This work tackles long video generation with pre-trained short diffusion models in a training-free setting. It introduces Brick-Diffusion and its brick-to-wall denoising, which slices latents into short segments and denoises them while shifting with a stride to promote cross-segment communication. The authors provide detailed procedures for latent slicing, offset computation, and parallelized denoising, and validate the method against several baselines on VBench, showing improved overall fidelity and dynamics. The approach enables scalable, high-quality long-video synthesis without fine-tuning, with practical implications for content creation and simulation.

Abstract

Recent advances in diffusion models have greatly improved text-driven video generation. However, training models for long video generation demands significant computational power and extensive data, leading most video diffusion models to be limited to a small number of frames. Existing training-free methods that attempt to generate long videos using pre-trained short video diffusion models often struggle with issues such as insufficient motion dynamics and degraded video fidelity. In this paper, we present Brick-Diffusion, a novel, training-free approach capable of generating long videos of arbitrary length. Our method introduces a brick-to-wall denoising strategy, where the latent is denoised in segments, with a stride applied in subsequent iterations. This process mimics the construction of a staggered brick wall, where each brick represents a denoised segment, enabling communication between frames and improving overall video quality. Through quantitative and qualitative evaluations, we demonstrate that Brick-Diffusion outperforms existing baseline methods in generating high-fidelity videos.
Paper Structure (15 sections, 8 equations, 3 figures, 2 tables)

This paper contains 15 sections, 8 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Comparisons between different methods for long video generation. (a) Concatenation: denoises segments individually and concatenates them. (b) Sliding window: denoises segments in a sliding window approach. (c) Ours: uses the brick-to-wall denoising, generating videos with high fidelity.
  • Figure 2: The framework of Brick-Diffusion. For each denoising step, we slice the latent into segments and denoise them individually using a diffusion model. In the subsequent step, we apply a stride to shift and re-slice the latent into new segments. This process is repeated until we obtain the final clean latent.
  • Figure 3: Qualitative results of each method. The text prompt is "a cute raccoon playing guitar in a boat on the ocean." The method of directly concatenating clips results in dramatic content changes. For the other baseline methods, we use red boxes to highlight the issues present in the generated video frames.