LongDiff: Training-Free Long Video Generation in One Go
Zhuoling Li, Hossein Rahmani, Qiuhong Ke, Jun Liu
TL;DR
LongDiff tackles the challenge of generating long videos with off-the-shelf short-video diffusion models by addressing two core issues: temporal position ambiguity and information dilution. It introduces a training-free solution with two components: Position Mapping (PM), which groups a dense set of relative frame positions into $2G-1$ groups and then refines ordering via SHIFT operations, and Informative Frame Selection (IFS), which constructs a pseudo-video to detect key frames and applies a masking scheme that restricts attention to neighbors and key frames. The approach is designed to be compatible with various relative positional encodings (e.g., RoPE) and is demonstrated to deliver state-of-the-art performance on 128-frame long videos across multiple baselines, with extensive ablations showing the necessity of both PM and IFS. Practically, LongDiff enables training-free long video generation with preserved temporal coherence and rich visual details, offering a scalable path for high-quality long-form video synthesis without retraining.
Abstract
Video diffusion models have recently achieved remarkable results in video generation. Despite their encouraging performance, most of these models are mainly designed and trained for short video generation, leading to challenges in maintaining temporal consistency and visual details in long video generation. In this paper, we propose LongDiff, a novel training-free method consisting of carefully designed components \ -- Position Mapping (PM) and Informative Frame Selection (IFS) \ -- to tackle two key challenges that hinder short-to-long video generation generalization: temporal position ambiguity and information dilution. Our LongDiff unlocks the potential of off-the-shelf video diffusion models to achieve high-quality long video generation in one go. Extensive experiments demonstrate the efficacy of our method.
