Table of Contents
Fetching ...

FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion

Yu Lu, Yi Yang

TL;DR

The paper tackles the difficulty of extending short-video diffusion models to long-form videos by identifying high-frequency distortion as a key bottleneck. It introduces FreeLong, which uses SpectralBlend Attention to fuse global low-frequency structure with local high-frequency details in the denoising process, all training-free. Building on this, FreeLong++ adds Multi-band SpectralFusion with multi-scale attention and SpecMix noise initialization to further stabilize long-range dynamics and preserve fine motion details. Across Wan-2.1 and LTX-Video, the approach yields substantial improvements in temporal consistency and visual fidelity for 4x and 8x longer videos and supports multi-prompt storytelling and long-range control without retraining.

Abstract

Recent advances in video generation models have enabled high-quality short video generation from text prompts. However, extending these models to longer videos remains a significant challenge, primarily due to degraded temporal consistency and visual fidelity. Our preliminary observations show that naively applying short-video generation models to longer sequences leads to noticeable quality degradation. Further analysis identifies a systematic trend where high-frequency components become increasingly distorted as video length grows, an issue we term high-frequency distortion. To address this, we propose FreeLong, a training-free framework designed to balance the frequency distribution of long video features during the denoising process. FreeLong achieves this by blending global low-frequency features, which capture holistic semantics across the full video, with local high-frequency features extracted from short temporal windows to preserve fine details. Building on this, FreeLong++ extends FreeLong dual-branch design into a multi-branch architecture with multiple attention branches, each operating at a distinct temporal scale. By arranging multiple window sizes from global to local, FreeLong++ enables multi-band frequency fusion from low to high frequencies, ensuring both semantic continuity and fine-grained motion dynamics across longer video sequences. Without any additional training, FreeLong++ can be plugged into existing video generation models (e.g. Wan2.1 and LTX-Video) to produce longer videos with substantially improved temporal consistency and visual fidelity. We demonstrate that our approach outperforms previous methods on longer video generation tasks (e.g. 4x and 8x of native length). It also supports coherent multi-prompt video generation with smooth scene transitions and enables controllable video generation using long depth or pose sequences.

FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion

TL;DR

The paper tackles the difficulty of extending short-video diffusion models to long-form videos by identifying high-frequency distortion as a key bottleneck. It introduces FreeLong, which uses SpectralBlend Attention to fuse global low-frequency structure with local high-frequency details in the denoising process, all training-free. Building on this, FreeLong++ adds Multi-band SpectralFusion with multi-scale attention and SpecMix noise initialization to further stabilize long-range dynamics and preserve fine motion details. Across Wan-2.1 and LTX-Video, the approach yields substantial improvements in temporal consistency and visual fidelity for 4x and 8x longer videos and supports multi-prompt storytelling and long-range control without retraining.

Abstract

Recent advances in video generation models have enabled high-quality short video generation from text prompts. However, extending these models to longer videos remains a significant challenge, primarily due to degraded temporal consistency and visual fidelity. Our preliminary observations show that naively applying short-video generation models to longer sequences leads to noticeable quality degradation. Further analysis identifies a systematic trend where high-frequency components become increasingly distorted as video length grows, an issue we term high-frequency distortion. To address this, we propose FreeLong, a training-free framework designed to balance the frequency distribution of long video features during the denoising process. FreeLong achieves this by blending global low-frequency features, which capture holistic semantics across the full video, with local high-frequency features extracted from short temporal windows to preserve fine details. Building on this, FreeLong++ extends FreeLong dual-branch design into a multi-branch architecture with multiple attention branches, each operating at a distinct temporal scale. By arranging multiple window sizes from global to local, FreeLong++ enables multi-band frequency fusion from low to high frequencies, ensuring both semantic continuity and fine-grained motion dynamics across longer video sequences. Without any additional training, FreeLong++ can be plugged into existing video generation models (e.g. Wan2.1 and LTX-Video) to produce longer videos with substantially improved temporal consistency and visual fidelity. We demonstrate that our approach outperforms previous methods on longer video generation tasks (e.g. 4x and 8x of native length). It also supports coherent multi-prompt video generation with smooth scene transitions and enables controllable video generation using long depth or pose sequences.

Paper Structure

This paper contains 26 sections, 9 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Results of Short and Longer Videos. The first row of each case shows short videos generated using short video diffusion models (81 frames for Wan-2.1 wan and 121 frames for LTX-Video ltxvideo). Directly extending these models to longer videos, like those with 4$\times$ (324 frames and 484 frames), preserves temporal consistency but lacks fine spatial-temporal details. In contrast, our proposed FreeLong and FreeLong++ adapts short video diffusion models to create consistent long videos with high fidelity.
  • Figure 2: Ratio of short video SNR on high (0.25$\pi$-1.0$\pi$)/low (0.0$\pi$-0.25$\pi$) frequency to longer videos. Our findings reveal that when direct extend short video diffusion model to generate longer videos, the SNR of high-frequency components in the space-time frequency domain degrades significantly as video length increases.
  • Figure 3: Attention Visualization. We visualize the attention by average across all layers and time steps from Wan2.1 wan. The attention maps for 81-frame videos exhibit a diagonal-like pattern, indicating a high correlation with adjacent frames, which helps preserve high-frequency details and motion patterns when generating new frames. In contrast, attention maps for longer videos are less structured, such as 648 frames (8$\times$), making the model struggle to identify and attend to the relevant information across distant frames. This lack of structure in the attention maps results in the distortion of high-frequency components of long videos, which results in the degradation of fine spatial-temporal details.
  • Figure 4: Fine-grained frequency analysis on longer video generation. (a) As video length increases, both the range and severity of frequency distortion grow substantially. (b) We define available frequency bands as those with a relative SNR above 0.9. As shown, the number of available bands drops significantly when the video length increases from 2$\times$ to 4$\times$, indicating that a fixed two-branch structure in FreeLong is insufficient for modeling motion dynamics in longer sequences. (c) High-frequency distortion correlates with attention window size: larger window sizes introduce more severe distortion in the high-frequency components.
  • Figure 5: Overview of FreeLong. FreeLong facilitates consistent and high-fidelity video generation using SpectralBlend Attention. SpectralBlend effectively blends low-frequency global video features with high-frequency local video features through a two-step process: local-global attention decoupling and spectral blending. Local video features are obtained by masking temporal attention to concentrate on fixed-length adjacent frames, while global temporal attention encompasses all frames. During spectral blending, 3D FFT projects features into the frequency domain, where high-frequency local components and low-frequency global components are merged. The resulting blended feature, transformed back to the time domain via IFFT, is then utilized in the subsequent block for refined video generation.
  • ...and 4 more figures