Tuning-Free Long Video Generation via Global-Local Collaborative Diffusion
Yongjia Ma, Junlin Chen, Donglin Di, Qi Xie, Lei Fan, Wei Chen, Xiaofei Gou, Na Zhao, Xun Yang
TL;DR
This work tackles the challenge of generating high-fidelity, coherent long videos without tuning or retraining. It introduces Global-Local Collaborative Denoising (GLCD) to model the long-video denoising trajectory through a global path that captures long-range dependencies and a local path that smooths frame-to-frame transitions, combined in a unified optimization. A Noise Reinitialization strategy boosts motion diversity and temporal alignment, while Video Motion Consistency Refinement (VMCR) uses gradient-based latent optimization to align both pixel- and frequency-domain motion cues. Empirically, the method, when plugged into a pre-trained short-video diffusion model like CogVideoX, extends generation from under 50 frames to over 1,000 frames with superior temporal coherence and visual fidelity, outperforming prior tuning-free long-video approaches. The approach demonstrates strong scalability and practical impact for production-quality long video synthesis without additional training requirements.
Abstract
Creating high-fidelity, coherent long videos is a sought-after aspiration. While recent video diffusion models have shown promising potential, they still grapple with spatiotemporal inconsistencies and high computational resource demands. We propose GLC-Diffusion, a tuning-free method for long video generation. It models the long video denoising process by establishing denoising trajectories through Global-Local Collaborative Denoising to ensure overall content consistency and temporal coherence between frames. Additionally, we introduce a Noise Reinitialization strategy which combines local noise shuffling with frequency fusion to improve global content consistency and visual diversity. Further, we propose a Video Motion Consistency Refinement (VMCR) module that computes the gradient of pixel-wise and frequency-wise losses to enhance visual consistency and temporal smoothness. Extensive experiments, including quantitative and qualitative evaluations on videos of varying lengths (\textit{e.g.}, 3\times and 6\times longer), demonstrate that our method effectively integrates with existing video diffusion models, producing coherent, high-fidelity long videos superior to previous approaches.
