Table of Contents
Fetching ...

VideoMerge: Towards Training-free Long Video Generation

Siyang Zhang, Harry Yang, Ser-Nam Lim

TL;DR

VideoMerge tackles the challenge of training-free long video generation by adapting pretrained text-to-short-video diffusion models. It introduces three synergistic components: long noise initialization to preserve global identity while enabling dynamics, multi-tile latent fusion with sine-weighted overlap to smooth transitions, and prompt refining to enforce detailed identity constraints via large-language models. Together, these strategies maintain expressive power and temporal coherence without additional training, outperforming state-of-the-art training-free baselines on identity preservation and motion consistency across human, animal, and landscape prompts. The approach offers a practical, resource-efficient solution for scalable long-video synthesis in real-world applications.

Abstract

Long video generation remains a challenging and compelling topic in computer vision. Diffusion based models, among the various approaches to video generation, have achieved state of the art quality with their iterative denoising procedures. However, the intrinsic complexity of the video domain renders the training of such diffusion models exceedingly expensive in terms of both data curation and computational resources. Moreover, these models typically operate on a fixed noise tensor that represents the video, resulting in predetermined spatial and temporal dimensions. Although several high quality open-source pretrained video diffusion models, jointly trained on images and videos of varying lengths and resolutions, are available, it is generally not recommended to specify a video length at inference that was not included in the training set. Consequently, these models are not readily adaptable to the direct generation of longer videos by merely increasing the specified video length. In addition to feasibility challenges, long-video generation also encounters quality issues. The domain of long videos is inherently more complex than that of short videos: extended durations introduce greater variability and necessitate long-range temporal consistency, thereby increasing the overall difficulty of the task. We propose VideoMerge, a training-free method that can be seamlessly adapted to merge short videos generated by pretrained text-to-video diffusion model. Our approach preserves the model's original expressiveness and consistency while allowing for extended duration and dynamic variation as specified by the user. By leveraging the strengths of pretrained models, our method addresses challenges related to smoothness, consistency, and dynamic content through orthogonal strategies that operate collaboratively to achieve superior quality.

VideoMerge: Towards Training-free Long Video Generation

TL;DR

VideoMerge tackles the challenge of training-free long video generation by adapting pretrained text-to-short-video diffusion models. It introduces three synergistic components: long noise initialization to preserve global identity while enabling dynamics, multi-tile latent fusion with sine-weighted overlap to smooth transitions, and prompt refining to enforce detailed identity constraints via large-language models. Together, these strategies maintain expressive power and temporal coherence without additional training, outperforming state-of-the-art training-free baselines on identity preservation and motion consistency across human, animal, and landscape prompts. The approach offers a practical, resource-efficient solution for scalable long-video synthesis in real-world applications.

Abstract

Long video generation remains a challenging and compelling topic in computer vision. Diffusion based models, among the various approaches to video generation, have achieved state of the art quality with their iterative denoising procedures. However, the intrinsic complexity of the video domain renders the training of such diffusion models exceedingly expensive in terms of both data curation and computational resources. Moreover, these models typically operate on a fixed noise tensor that represents the video, resulting in predetermined spatial and temporal dimensions. Although several high quality open-source pretrained video diffusion models, jointly trained on images and videos of varying lengths and resolutions, are available, it is generally not recommended to specify a video length at inference that was not included in the training set. Consequently, these models are not readily adaptable to the direct generation of longer videos by merely increasing the specified video length. In addition to feasibility challenges, long-video generation also encounters quality issues. The domain of long videos is inherently more complex than that of short videos: extended durations introduce greater variability and necessitate long-range temporal consistency, thereby increasing the overall difficulty of the task. We propose VideoMerge, a training-free method that can be seamlessly adapted to merge short videos generated by pretrained text-to-video diffusion model. Our approach preserves the model's original expressiveness and consistency while allowing for extended duration and dynamic variation as specified by the user. By leveraging the strengths of pretrained models, our method addresses challenges related to smoothness, consistency, and dynamic content through orthogonal strategies that operate collaboratively to achieve superior quality.

Paper Structure

This paper contains 16 sections, 4 equations, 12 figures, 1 table, 1 algorithm.

Figures (12)

  • Figure 1: Comparison between our proposed VideoMerge and other state-of-the-art methods. Original prompt: "A woman is dancing in indoor garden." Our method is able to preserve consistency in human identity in terms of face, clothes, hair style. We provide an extensive list of videos in the supplementary material that clearly demonstrates our superiority over current methods.
  • Figure 2: The weight assigned to each frame latent in a latent denoising tile follows a sine curve which allows smooth transition between adjacent tiles.
  • Figure 3: Original prompt: A tiger is walking inside a cage.
  • Figure 4: Original prompt: Beautiful scenery of flowing waterfalls and river.
  • Figure 5: Prompting to a large language model to enhance short and abstract text prompts with specific requirements.
  • ...and 7 more figures