Table of Contents
Fetching ...

Towards Chunk-Wise Generation for Long Videos

Siyang Zhang, Ser-Nam Lim

TL;DR

This work tackles the memory bottleneck of generating long videos with diffusion models by adopting an autoregressive chunk-by-chunk strategy using pretrained I2V models. It reveals that initial noise quality critically influences per-chunk outputs and proposes a fast $k$-step noise evaluation and search to select favorable noises, mitigating error accumulation for smaller models. The approach is validated across multiple I2V backbones, showing that small models benefit significantly from $k$-step search while large models are already robust, enabling practical long-video generation. Overall, the paper demonstrates a training-free, scalable paradigm for extending diffusion-based video generation to longer sequences with improved consistency and efficiency.

Abstract

Generating long-duration videos has always been a significant challenge due to the inherent complexity of spatio-temporal domain and the substantial GPU memory demands required to calculate huge size tensors. While diffusion based generative models achieve state-of-the-art performance in video generation task, they are typically trained with predefined video resolutions and lengths. During inference, a noise tensor with specific resolution and length should be specified at first, and the model will perform denoising on the entire video tensor simultaneously, all the frames together. Such approach will easily raise an out-of-memory (OOM) problem when the specified resolution and/or length exceed a certain limit. One of the solutions to this problem is to generate many short video chunks autoregressively with strong inter-chunk spatio-temporal relation and then concatenate them together to form a long video. In this approach, a long video generation task is divided into multiple short video generation subtasks, and the cost of each subtask is reduced to a feasible level. In this paper, we conduct a detailed survey on long video generation with the autoregressive chunk-by-chunk strategy. We address common problems caused by applying short image-to-video models to long video tasks and design an efficient $k$-step search solution to mitigate these problems.

Towards Chunk-Wise Generation for Long Videos

TL;DR

This work tackles the memory bottleneck of generating long videos with diffusion models by adopting an autoregressive chunk-by-chunk strategy using pretrained I2V models. It reveals that initial noise quality critically influences per-chunk outputs and proposes a fast -step noise evaluation and search to select favorable noises, mitigating error accumulation for smaller models. The approach is validated across multiple I2V backbones, showing that small models benefit significantly from -step search while large models are already robust, enabling practical long-video generation. Overall, the paper demonstrates a training-free, scalable paradigm for extending diffusion-based video generation to longer sequences with improved consistency and efficiency.

Abstract

Generating long-duration videos has always been a significant challenge due to the inherent complexity of spatio-temporal domain and the substantial GPU memory demands required to calculate huge size tensors. While diffusion based generative models achieve state-of-the-art performance in video generation task, they are typically trained with predefined video resolutions and lengths. During inference, a noise tensor with specific resolution and length should be specified at first, and the model will perform denoising on the entire video tensor simultaneously, all the frames together. Such approach will easily raise an out-of-memory (OOM) problem when the specified resolution and/or length exceed a certain limit. One of the solutions to this problem is to generate many short video chunks autoregressively with strong inter-chunk spatio-temporal relation and then concatenate them together to form a long video. In this approach, a long video generation task is divided into multiple short video generation subtasks, and the cost of each subtask is reduced to a feasible level. In this paper, we conduct a detailed survey on long video generation with the autoregressive chunk-by-chunk strategy. We address common problems caused by applying short image-to-video models to long video tasks and design an efficient -step search solution to mitigate these problems.

Paper Structure

This paper contains 19 sections, 4 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: An overview of the pipeline of autoregressive chunk-by-chunk long video generation. Each time an Image-to-Video(I2V) model takes in a guide image as condition and generate a short video chunk. Then, the model will take the last frame of collected videos, and predict the next video chunk.
  • Figure 2: Examples of degradation effect on video quality in chunk-by-chunk video generation by StableVideoDiffusion blattmann2023stablevideodiffusion and ConsistI2V ren2024consisti2v. For each guide image, we perform naive chunk-by-chunk generation (top row) and $k$-step search generation (bottom row). The model created some artifacts in each chunk, and the cumulated effect will at last destroy the long video as the number of chunks increases. Our $k$-step search helps to mitigate the degradation.
  • Figure 3: Examples of long videos generated by OpenSoraPlanV1.3.0 and CogVideoX. For each guide image, we perform naive chunk-by-chunk generation (top row) and $k$-step search generation (bottom row). These models are more robust to initial noise.
  • Figure 4: $k$-step search: we first prepare $m$ initial noises and then for each of them, call the base I2V model to only denoise for $k$ steps, resulting in $k$ suboptimal short video candidates. After that, we explicitly evaluate the $k$ video candidates and find the one with the best quality. Finally, we use the noise that leads to the best video to perform a full step denoising.
  • Figure 5: Examples of sampling results with the same conditioning input but different initial noises. Row 1-3 share same conditioning input, same for row 4-6, and row 7-9.
  • ...and 1 more figures