Table of Contents
Fetching ...

LongDiff: Training-Free Long Video Generation in One Go

Zhuoling Li, Hossein Rahmani, Qiuhong Ke, Jun Liu

TL;DR

LongDiff tackles the challenge of generating long videos with off-the-shelf short-video diffusion models by addressing two core issues: temporal position ambiguity and information dilution. It introduces a training-free solution with two components: Position Mapping (PM), which groups a dense set of relative frame positions into $2G-1$ groups and then refines ordering via SHIFT operations, and Informative Frame Selection (IFS), which constructs a pseudo-video to detect key frames and applies a masking scheme that restricts attention to neighbors and key frames. The approach is designed to be compatible with various relative positional encodings (e.g., RoPE) and is demonstrated to deliver state-of-the-art performance on 128-frame long videos across multiple baselines, with extensive ablations showing the necessity of both PM and IFS. Practically, LongDiff enables training-free long video generation with preserved temporal coherence and rich visual details, offering a scalable path for high-quality long-form video synthesis without retraining.

Abstract

Video diffusion models have recently achieved remarkable results in video generation. Despite their encouraging performance, most of these models are mainly designed and trained for short video generation, leading to challenges in maintaining temporal consistency and visual details in long video generation. In this paper, we propose LongDiff, a novel training-free method consisting of carefully designed components \ -- Position Mapping (PM) and Informative Frame Selection (IFS) \ -- to tackle two key challenges that hinder short-to-long video generation generalization: temporal position ambiguity and information dilution. Our LongDiff unlocks the potential of off-the-shelf video diffusion models to achieve high-quality long video generation in one go. Extensive experiments demonstrate the efficacy of our method.

LongDiff: Training-Free Long Video Generation in One Go

TL;DR

LongDiff tackles the challenge of generating long videos with off-the-shelf short-video diffusion models by addressing two core issues: temporal position ambiguity and information dilution. It introduces a training-free solution with two components: Position Mapping (PM), which groups a dense set of relative frame positions into groups and then refines ordering via SHIFT operations, and Informative Frame Selection (IFS), which constructs a pseudo-video to detect key frames and applies a masking scheme that restricts attention to neighbors and key frames. The approach is designed to be compatible with various relative positional encodings (e.g., RoPE) and is demonstrated to deliver state-of-the-art performance on 128-frame long videos across multiple baselines, with extensive ablations showing the necessity of both PM and IFS. Practically, LongDiff enables training-free long video generation with preserved temporal coherence and rich visual details, offering a scalable path for high-quality long-form video synthesis without retraining.

Abstract

Video diffusion models have recently achieved remarkable results in video generation. Despite their encouraging performance, most of these models are mainly designed and trained for short video generation, leading to challenges in maintaining temporal consistency and visual details in long video generation. In this paper, we propose LongDiff, a novel training-free method consisting of carefully designed components \ -- Position Mapping (PM) and Informative Frame Selection (IFS) \ -- to tackle two key challenges that hinder short-to-long video generation generalization: temporal position ambiguity and information dilution. Our LongDiff unlocks the potential of off-the-shelf video diffusion models to achieve high-quality long video generation in one go. Extensive experiments demonstrate the efficacy of our method.

Paper Structure

This paper contains 29 sections, 5 theorems, 19 equations, 10 figures, 14 tables.

Key Result

Theorem 1

Define the attention logit function in temporal attention as $f(\mathbf{q}, \mathbf{k}, p)$, which maps the query frame $\mathbf{q}$, key frame $\mathbf{k}$, and their relative position $p$ to a scalar value. Consider a video generation task with $N$ frames, where the model categorizes the $2N$$-$$1 where $r$ is the pseudo-dimension of the function class $\mathcal{H} = \{f(\cdot, \cdot, p) \mid p

Figures (10)

  • Figure 1: Results of directly applying short-video diffusion models (LaVie wang2023lavie and VideoCrafter chen2023videocrafter1 ) for long video generation. Since the spatial transformer layers in these short video models operate independently of video lengths, and the temporal transformer layers can process input sequences of various length, we only need to extend the length of the noise sequences used as the starting point for denoising to achieve long video generation. It can be observed that though short videos have good quality, long videos exhibit inferior temporal consistency (marked by the red boxes), such as abrupt transitions of the polar bear's expression and the sudden appearance and disappearance of the yellow flower. Additionally, the long videos lack some key visual details (marked by the orange boxes), such as the missing "drum" and "wooden bowl", and blurred "NYC Times Square".
  • Figure 2: Overview of our proposed method. CA and SA denote cross-attention and self-attention, respectively. During the denoising steps, the video hidden states $Z$ iteratively pass through temporal transformer layers where our LongDiff mechanism is applied. LongDiff comprises two key components: Position Mapping (PM) and Informative Frame Selection (IFS), corresponding to two modifications to temporal self-attention. First, we transform the original relative position matrix (the green matrix) via PM to alleviate the temporal position ambiguity issue. Additionally, a specially designed IFS mask restricts the temporal correlations of each query frame to both its neighbor frames and a set of detected key frames, to avoid the problem of information dilution.
  • Figure 3: Figure (a) shows the GROUP operation where $N=9$ and $G=3$. The query-axis and key-axis of the matrices represent positions of the query frames and key frames, respectively. Each matrix entry represents the relative position between the query and key frames. In the grouped relative position matrix, the 17 original relative positions (ranging from $-8$ to $8$) are grouped into 5 groups (from $-2$ to $2$). Figure (b) shows a simple case of the SHIFT operation on the first column of the shifted relative position matrix $\mathbf{G}^{(m)}$. The red box represents the "assignment record".
  • Figure 4: Illustration of the SHIFT operation. In each SHIFT operation, each entry in the upper triangle is shifted to the right by one position, with zeros added at the left. Meanwhile, each entry in the lower triangle is shifted downward by one position, with zeros added at the top.
  • Figure 5: Qualitative comparisons of longer video generation. (128 frames). We illustrate inferior temporal consistency and the lack of visual details using red and orange boxes, respectively. More examples are in Supplementary.
  • ...and 5 more figures

Theorems & Definitions (7)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Lemma 1
  • proof
  • Theorem 4
  • proof