Table of Contents
Fetching ...

Tuning-Free Long Video Generation via Global-Local Collaborative Diffusion

Yongjia Ma, Junlin Chen, Donglin Di, Qi Xie, Lei Fan, Wei Chen, Xiaofei Gou, Na Zhao, Xun Yang

TL;DR

This work tackles the challenge of generating high-fidelity, coherent long videos without tuning or retraining. It introduces Global-Local Collaborative Denoising (GLCD) to model the long-video denoising trajectory through a global path that captures long-range dependencies and a local path that smooths frame-to-frame transitions, combined in a unified optimization. A Noise Reinitialization strategy boosts motion diversity and temporal alignment, while Video Motion Consistency Refinement (VMCR) uses gradient-based latent optimization to align both pixel- and frequency-domain motion cues. Empirically, the method, when plugged into a pre-trained short-video diffusion model like CogVideoX, extends generation from under 50 frames to over 1,000 frames with superior temporal coherence and visual fidelity, outperforming prior tuning-free long-video approaches. The approach demonstrates strong scalability and practical impact for production-quality long video synthesis without additional training requirements.

Abstract

Creating high-fidelity, coherent long videos is a sought-after aspiration. While recent video diffusion models have shown promising potential, they still grapple with spatiotemporal inconsistencies and high computational resource demands. We propose GLC-Diffusion, a tuning-free method for long video generation. It models the long video denoising process by establishing denoising trajectories through Global-Local Collaborative Denoising to ensure overall content consistency and temporal coherence between frames. Additionally, we introduce a Noise Reinitialization strategy which combines local noise shuffling with frequency fusion to improve global content consistency and visual diversity. Further, we propose a Video Motion Consistency Refinement (VMCR) module that computes the gradient of pixel-wise and frequency-wise losses to enhance visual consistency and temporal smoothness. Extensive experiments, including quantitative and qualitative evaluations on videos of varying lengths (\textit{e.g.}, 3\times and 6\times longer), demonstrate that our method effectively integrates with existing video diffusion models, producing coherent, high-fidelity long videos superior to previous approaches.

Tuning-Free Long Video Generation via Global-Local Collaborative Diffusion

TL;DR

This work tackles the challenge of generating high-fidelity, coherent long videos without tuning or retraining. It introduces Global-Local Collaborative Denoising (GLCD) to model the long-video denoising trajectory through a global path that captures long-range dependencies and a local path that smooths frame-to-frame transitions, combined in a unified optimization. A Noise Reinitialization strategy boosts motion diversity and temporal alignment, while Video Motion Consistency Refinement (VMCR) uses gradient-based latent optimization to align both pixel- and frequency-domain motion cues. Empirically, the method, when plugged into a pre-trained short-video diffusion model like CogVideoX, extends generation from under 50 frames to over 1,000 frames with superior temporal coherence and visual fidelity, outperforming prior tuning-free long-video approaches. The approach demonstrates strong scalability and practical impact for production-quality long video synthesis without additional training requirements.

Abstract

Creating high-fidelity, coherent long videos is a sought-after aspiration. While recent video diffusion models have shown promising potential, they still grapple with spatiotemporal inconsistencies and high computational resource demands. We propose GLC-Diffusion, a tuning-free method for long video generation. It models the long video denoising process by establishing denoising trajectories through Global-Local Collaborative Denoising to ensure overall content consistency and temporal coherence between frames. Additionally, we introduce a Noise Reinitialization strategy which combines local noise shuffling with frequency fusion to improve global content consistency and visual diversity. Further, we propose a Video Motion Consistency Refinement (VMCR) module that computes the gradient of pixel-wise and frequency-wise losses to enhance visual consistency and temporal smoothness. Extensive experiments, including quantitative and qualitative evaluations on videos of varying lengths (\textit{e.g.}, 3\times and 6\times longer), demonstrate that our method effectively integrates with existing video diffusion models, producing coherent, high-fidelity long videos superior to previous approaches.
Paper Structure (35 sections, 16 equations, 12 figures, 6 tables)

This paper contains 35 sections, 16 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Overview of our GLC Diffusion. It illustrates the denoising process from $z_t$ to $z_{t-1}$, integrating our proposed modules: Global-Local Collaborative Denoising (GLCD), Local Random Shifting Sampling (LRSS), Attention-Based Adaptive Modulation (ABAM), and Video Motion Consistency Regularization (VMCR). GLCD consists of global and local denoising paths to maintain overall content consistency and enhance local temporal coherence. LRSS improves spatio-temporal coherence by sampling local frames with random shifts. ABAM adaptively modulates attention weights to emphasize important regions, while VMCR enforces motion consistency across frames.
  • Figure 2: Illustration of the Video Motion Consistency Refinement (VMCR) module. The VMCR module minimizes both pixel-wise loss and frequency-wise loss, aligning motion predictions between frames to enhance visual consistency and temporal smoothness in the generated video.
  • Figure 3: Qualitative comparison of long video generation methods with varying lengths (3× and 6×). Visual comparisons are presented for Direct Sampling, FreeLong, GenL, and FreeNoise in order. Direct Sampling and FreeLong produce overly smooth videos with noticeable quality degradation, especially for 6× length, where the visual quality is poor and details are lost. GenL and FreeNoise show improvements in temporal coherence but still suffer from artifacts and significant detail loss. In contrast, our GLC Diffusion consistently generates high-quality videos with smooth motion and consistent content across both 3× and 6× lengths, effectively preserving crucial details and textures.
  • Figure 4: Ablation Study on GLC Diffusion Components. We analyze the impact of each component in our method by conducting ablation experiments: (a) w/o GLCD, (b) w/o global path (c) w/o local path, (d) w/o Noise Reinit, (e) w/o VMCR, and (f) Ours.
  • Figure 5: Qualitative Results of Annealing Coefficient $\gamma$ in GLCD.
  • ...and 7 more figures