Table of Contents
Fetching ...

Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

Jingyuan Chen, Fuchen Long, Jie An, Zhaofan Qiu, Ting Yao, Jiebo Luo, Tao Mei

TL;DR

Ouroboros-Diffusion tackles the challenge of tuning-free long video generation by strengthening both structural and subject-level consistency in a FIFO-based diffusion framework. It introduces coherent tail latent sampling to preserve layout while injecting motion, Subject-Aware Cross-Frame Attention (SACFA) to align subjects across frames, and self-recurrent guidance that leverages a subject memory bank to propagate information across time. Extensive VBench experiments show improvements in subject and background consistency, motion smoothness, and reduced temporal flickering, while maintaining dynamic motion. The approach demonstrates that leveraging cross-frame subject cues and memory-guided latent optimization can achieve more coherent long videos without extensive fine-tuning. This offers a practical, scalable path for high-quality, tuning-free long-video synthesis rooted in pre-trained diffusion models.

Abstract

The first-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. This technique maintains a queue of video frames with progressively increasing noise, continuously producing clean frames at the queue's head while Gaussian noise is enqueued at the tail. However, FIFO-Diffusion often struggles to keep long-range temporal consistency in the generated videos due to the lack of correspondence modeling across frames. In this paper, we propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency, enabling the generation of consistent videos of arbitrary length. Specifically, we introduce a new latent sampling technique at the queue tail to improve structural consistency, ensuring perceptually smooth transitions among frames. To enhance subject consistency, we devise a Subject-Aware Cross-Frame Attention (SACFA) mechanism, which aligns subjects across frames within short segments to achieve better visual coherence. Furthermore, we introduce self-recurrent guidance. This technique leverages information from all previous cleaner frames at the front of the queue to guide the denoising of noisier frames at the end, fostering rich and contextual global information interaction. Extensive experiments of long video generation on the VBench benchmark demonstrate the superiority of our Ouroboros-Diffusion, particularly in terms of subject consistency, motion smoothness, and temporal consistency.

Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

TL;DR

Ouroboros-Diffusion tackles the challenge of tuning-free long video generation by strengthening both structural and subject-level consistency in a FIFO-based diffusion framework. It introduces coherent tail latent sampling to preserve layout while injecting motion, Subject-Aware Cross-Frame Attention (SACFA) to align subjects across frames, and self-recurrent guidance that leverages a subject memory bank to propagate information across time. Extensive VBench experiments show improvements in subject and background consistency, motion smoothness, and reduced temporal flickering, while maintaining dynamic motion. The approach demonstrates that leveraging cross-frame subject cues and memory-guided latent optimization can achieve more coherent long videos without extensive fine-tuning. This offers a practical, scalable path for high-quality, tuning-free long-video synthesis rooted in pre-trained diffusion models.

Abstract

The first-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. This technique maintains a queue of video frames with progressively increasing noise, continuously producing clean frames at the queue's head while Gaussian noise is enqueued at the tail. However, FIFO-Diffusion often struggles to keep long-range temporal consistency in the generated videos due to the lack of correspondence modeling across frames. In this paper, we propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency, enabling the generation of consistent videos of arbitrary length. Specifically, we introduce a new latent sampling technique at the queue tail to improve structural consistency, ensuring perceptually smooth transitions among frames. To enhance subject consistency, we devise a Subject-Aware Cross-Frame Attention (SACFA) mechanism, which aligns subjects across frames within short segments to achieve better visual coherence. Furthermore, we introduce self-recurrent guidance. This technique leverages information from all previous cleaner frames at the front of the queue to guide the denoising of noisier frames at the end, fostering rich and contextual global information interaction. Extensive experiments of long video generation on the VBench benchmark demonstrate the superiority of our Ouroboros-Diffusion, particularly in terms of subject consistency, motion smoothness, and temporal consistency.
Paper Structure (30 sections, 6 equations, 5 figures, 5 tables)

This paper contains 30 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of FIFO-Diffusion kim2024fifo (top) and our Ouroboros-Diffusion (bottom) for tuning-free long video generation.
  • Figure 2: An overview of our Ouroboros-Diffusion. The whole framework (a) contains three key components: coherent tail latent sampling in queue manager , (b) Subject-Aware Cross-frame Attention (SACFA), and (c) self-recurrent guidance. The coherent tail latent sampling in queue manager derives the enqueued frame latents at the queue tail to improve structural consistency. The Subject-Aware Cross-frame Attention (SACFA) aligns subjects across frames within short segments for better visual coherence. The self-recurrent guidance leverages information from all historical cleaner frames to guide the denoising of noisier frames, fostering rich and contextual global information interaction.
  • Figure 3: The detailed illustration of coherent tail latent sampling in the queue manager.
  • Figure 4: Visual examples of single-scene long video generation by different approaches. The text prompt is "A cat wearing sunglasses and working as a lifeguard at a pool."
  • Figure 5: Visual examples of multi-scene long video generation by different approaches. The multi-scene prompts are: 1). an astronaut is riding a horse in space; 2). an astronaut is riding a dragon in space; 3). an astronaut is riding a motorcycle in space.