Table of Contents
Fetching ...

UFO: Enhancing Diffusion-Based Video Generation with a Uniform Frame Organizer

Delong Liu, Zhaohui Hou, Mingjie Zhan, Shihao Han, Zhicheng Zhao, Fei Su

TL;DR

Diffuse-video quality and consistency remain challenging as sequences grow longer. The Uniform Frame Organizer (UFO) introduces lightweight adapters that attach to diffusion backbones, enabling non-destructive consistency improvements with an adjustable intensity parameter and fast, resource-efficient training (≈3000 steps on a single GPU). By updating only the UFOs and allowing direct transfer across models of the same specification, UFO achieves improved temporal and frame-wise quality (as measured by Vbench) and supports stylization while preserving original outputs. The approach is modular, transferable, and practical for creating personalized, high-quality diffuse videos with minimal retraining, and it includes a discussion of limitations and future work on automatic intensity adjustment.

Abstract

Recently, diffusion-based video generation models have achieved significant success. However, existing models often suffer from issues like weak consistency and declining image quality over time. To overcome these challenges, inspired by aesthetic principles, we propose a non-invasive plug-in called Uniform Frame Organizer (UFO), which is compatible with any diffusion-based video generation model. The UFO comprises a series of adaptive adapters with adjustable intensities, which can significantly enhance the consistency between the foreground and background of videos and improve image quality without altering the original model parameters when integrated. The training for UFO is simple, efficient, requires minimal resources, and supports stylized training. Its modular design allows for the combination of multiple UFOs, enabling the customization of personalized video generation models. Furthermore, the UFO also supports direct transferability across different models of the same specification without the need for specific retraining. The experimental results indicate that UFO effectively enhances video generation quality and demonstrates its superiority in public video generation benchmarks. The code will be publicly available at https://github.com/Delong-liu-bupt/UFO.

UFO: Enhancing Diffusion-Based Video Generation with a Uniform Frame Organizer

TL;DR

Diffuse-video quality and consistency remain challenging as sequences grow longer. The Uniform Frame Organizer (UFO) introduces lightweight adapters that attach to diffusion backbones, enabling non-destructive consistency improvements with an adjustable intensity parameter and fast, resource-efficient training (≈3000 steps on a single GPU). By updating only the UFOs and allowing direct transfer across models of the same specification, UFO achieves improved temporal and frame-wise quality (as measured by Vbench) and supports stylization while preserving original outputs. The approach is modular, transferable, and practical for creating personalized, high-quality diffuse videos with minimal retraining, and it includes a discussion of limitations and future work on automatic intensity adjustment.

Abstract

Recently, diffusion-based video generation models have achieved significant success. However, existing models often suffer from issues like weak consistency and declining image quality over time. To overcome these challenges, inspired by aesthetic principles, we propose a non-invasive plug-in called Uniform Frame Organizer (UFO), which is compatible with any diffusion-based video generation model. The UFO comprises a series of adaptive adapters with adjustable intensities, which can significantly enhance the consistency between the foreground and background of videos and improve image quality without altering the original model parameters when integrated. The training for UFO is simple, efficient, requires minimal resources, and supports stylized training. Its modular design allows for the combination of multiple UFOs, enabling the customization of personalized video generation models. Furthermore, the UFO also supports direct transferability across different models of the same specification without the need for specific retraining. The experimental results indicate that UFO effectively enhances video generation quality and demonstrates its superiority in public video generation benchmarks. The code will be publicly available at https://github.com/Delong-liu-bupt/UFO.

Paper Structure

This paper contains 18 sections, 3 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: The left side displays three cases: Cases 1 and 3 illustrate that our proposed consistency UFO can be integrated with the model to significantly enhance the consistency of generation. Case 2 demonstrates that the UFO can be directly transferred and effectively deployed between models of the same specification without the need for training. The cases on the right show that different consistency and stylization UFOs can be freely combined to customize video generators.
  • Figure 2: Training and inference of consistency UFO. During training, all parameters of the original model are frozen, and the UFO operates at maximum intensity using image-text pair data, with images duplicated across multiple frames to meet training requirements. In inference, zero intensity mirrors the original generator, while low intensity improves video consistency. The right images compare these two scenarios.
  • Figure 3: Visualizations of the consistency UFO. The areas highlighted in red boxes show inconsistencies or blurriness in the videos produced by the pre-trained model.
  • Figure 4: Examples of the effects of consistency UFO with stylization UFO. The first row illustrates the results without using stylization UFO, while rows two to five demonstrate the effects of different stylization UFOs. Videos on the right have added consistency UFO compared to those on the left. In these cases, all the stylization UFOs have $\alpha=1$ and the consistency UFO have $\alpha=0.1$, all generated by Open with a resolution of $720 \times 1280$ and a duration of 4 seconds.
  • Figure 5: Metric variations with different $\alpha$ levels for the consistency UFO. Left image metrics are from a 4-second video at $576 \times 1008$ on Easy. Right image metrics are from a 4-second video at $720 \times 1280$ on Open. Light blue indicates conservative strategy, orange for moderate, and red for aggressive.
  • ...and 7 more figures