Table of Contents
Fetching ...

Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction

Jiahao Tian, Chenxi Song, Wei Cheng, Chi Zhang

Abstract

Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and context-length O.O.D. To address these challenges, we propose FreeLOC, a novel training-free, layer-adaptive framework that introduces two core techniques: Video-based Relative Position Re-encoding (VRPR) for frame-level relative position O.O.D, a multi-granularity strategy that hierarchically re-encodes temporal relative positions to align with the model's pre-trained distribution, and Tiered Sparse Attention (TSA) for context-length O.O.D, which preserves both local detail and long-range dependencies by structuring attention density across different temporal scales. Crucially, we introduce a layer-adaptive probing mechanism that identifies the sensitivity of each transformer layer to these O.O.D issues, allowing for the selective and efficient application of our methods. Extensive experiments demonstrate that our approach significantly outperforms existing training-free methods, achieving state-of-the-art results in both temporal consistency and visual quality. Code is available at https://github.com/Westlake-AGI-Lab/FreeLOC.

Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction

Abstract

Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and context-length O.O.D. To address these challenges, we propose FreeLOC, a novel training-free, layer-adaptive framework that introduces two core techniques: Video-based Relative Position Re-encoding (VRPR) for frame-level relative position O.O.D, a multi-granularity strategy that hierarchically re-encodes temporal relative positions to align with the model's pre-trained distribution, and Tiered Sparse Attention (TSA) for context-length O.O.D, which preserves both local detail and long-range dependencies by structuring attention density across different temporal scales. Crucially, we introduce a layer-adaptive probing mechanism that identifies the sensitivity of each transformer layer to these O.O.D issues, allowing for the selective and efficient application of our methods. Extensive experiments demonstrate that our approach significantly outperforms existing training-free methods, achieving state-of-the-art results in both temporal consistency and visual quality. Code is available at https://github.com/Westlake-AGI-Lab/FreeLOC.

Paper Structure

This paper contains 30 sections, 19 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Overview of FreeLOC. FreeLOC employs an offline, layer-wise probing procedure to identify DiT layers sensitive to two O.O.D sources: context-length O.O.D and frame-level relative position O.O.D. For layers only sensitive to frame-level relative position, we apply VRPR strategy to hierarchically remap out-of-range positions back into the pre-trained domain. For layers sensitive to context-length, we utilize TSA combined with VRPR to balance local detail and global coherence. Zoom in for better view.
  • Figure 1: Layer-wise sensitivity to context-length O.O.D measured via attention entropy differences.
  • Figure 2: Sensitivity analysis for each layer during frame-level relative position O.O.D probing. (a) Vision Reward for each probing layer. A lower score indicates that the layer is more sensitive to frame-level relative position O.O.D. (b) Attention Logits Difference (ALD) for each probing layer. A higher value indicates a significant change in the behavior of the attention mechanism, signifying high sensitivity to positional O.O.D.
  • Figure 2: Impact of $W_1$, $W_2$ , $G_1$ and $G_2$ on Subject Consistency and Imaging Quality for VRPR.
  • Figure 3: Layer-wise sensitivity to frame-level relative-position O.O.D. The figure compares the original video with probing outputs from Layer 18 (low sensitivity) and Layer 28 (high sensitivity). Stronger sensitivity leads to noticeable distortions when relative-position shifts are applied.
  • ...and 9 more figures