Table of Contents
Fetching ...

Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation

Jia Li, Xiaomeng Fu, Xurui Peng, Weifeng Chen, Youwei Zheng, Tianyu Zhao, Jiexi Wang, Fangmin Chen, Xing Wang, Hayden Kwok-Hay So

TL;DR

FLEX (Frequency-aware Length EXtension), a training-free inference-time framework that bridges the gap between short-term training and long-term inference, effectively pushes the generation limits of models such as LongLive, supporting consistent and dynamic video synthesis at a 4-minute scale.

Abstract

Autoregressive video diffusion models have emerged as a scalable paradigm for long video generation. However, they often suffer from severe extrapolation failure, where rapid error accumulation leads to significant temporal degradation when extending beyond training horizons. We identify that this failure primarily stems from the spectral bias of 3D positional embeddings and the lack of dynamic priors in noise sampling. To address these issues, we propose FLEX (Frequency-aware Length EXtension), a training-free inference-time framework that bridges the gap between short-term training and long-term inference. FLEX introduces Frequency-aware RoPE Modulation to adaptively interpolate under-trained low-frequency components while extrapolating high-frequency ones to preserve multi-scale temporal discriminability. This is integrated with Antiphase Noise Sampling (ANS) to inject high-frequency dynamic priors and Inference-only Attention Sink to anchor global structure. Extensive evaluations on VBench demonstrate that FLEX significantly outperforms state-of-the-art models at 6x extrapolation (30s duration) and matches the performance of long-video fine-tuned baselines at 12x scale (60s duration). As a plug-and-play augmentation, FLEX seamlessly integrates into existing inference pipelines for horizon extension. It effectively pushes the generation limits of models such as LongLive, supporting consistent and dynamic video synthesis at a 4-minute scale. Project page is available at https://ga-lee.github.io/FLEX_demo.

Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation

TL;DR

FLEX (Frequency-aware Length EXtension), a training-free inference-time framework that bridges the gap between short-term training and long-term inference, effectively pushes the generation limits of models such as LongLive, supporting consistent and dynamic video synthesis at a 4-minute scale.

Abstract

Autoregressive video diffusion models have emerged as a scalable paradigm for long video generation. However, they often suffer from severe extrapolation failure, where rapid error accumulation leads to significant temporal degradation when extending beyond training horizons. We identify that this failure primarily stems from the spectral bias of 3D positional embeddings and the lack of dynamic priors in noise sampling. To address these issues, we propose FLEX (Frequency-aware Length EXtension), a training-free inference-time framework that bridges the gap between short-term training and long-term inference. FLEX introduces Frequency-aware RoPE Modulation to adaptively interpolate under-trained low-frequency components while extrapolating high-frequency ones to preserve multi-scale temporal discriminability. This is integrated with Antiphase Noise Sampling (ANS) to inject high-frequency dynamic priors and Inference-only Attention Sink to anchor global structure. Extensive evaluations on VBench demonstrate that FLEX significantly outperforms state-of-the-art models at 6x extrapolation (30s duration) and matches the performance of long-video fine-tuned baselines at 12x scale (60s duration). As a plug-and-play augmentation, FLEX seamlessly integrates into existing inference pipelines for horizon extension. It effectively pushes the generation limits of models such as LongLive, supporting consistent and dynamic video synthesis at a 4-minute scale. Project page is available at https://ga-lee.github.io/FLEX_demo.
Paper Structure (46 sections, 3 theorems, 27 equations, 15 figures, 7 tables)

This paper contains 46 sections, 3 theorems, 27 equations, 15 figures, 7 tables.

Key Result

Proposition 3.1

The ANS construction in Eq. eq:ans_recursion ensures that:(i) Marginal Preservation: Each frame maintains the standard Gaussian distribution $\mathbf{z}_u \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$, since: (ii) Toeplitz Covariance: The temporal correlation follows a symmetric Toeplitz structure, $\text{Cov}(\mathbf{z}_u, \mathbf{z}_v) = \rho^{|u-v|}\mathbf{I}$.

Figures (15)

  • Figure 1: Frequency-aware analysis of temporal RoPE. (Left) Visualization of rotary periodicity $\sin(\Delta n \theta_m)$ along the temporal dimension. (Right) Training exposure $r_m$, defined as the number of completed cycles given a training horizon of $L_{train}=21$.
  • Figure 2: Spectral analysis of Antiphase Noise Sampling. (Left) Power Spectral Density $S_{\rho}(\omega)$. (Right) Motion Energy Density $|H(\omega)|^2 S_{\rho}(\omega)$. The antiphase regime ($\rho < 0$) shifts power toward the high-frequency passband, boosting motion energy relative to the i.i.d. baseline ($\rho=0$).
  • Figure 3: Qualitative comparison of video generation over 30s seconds. Our method yields higher-fidelity video generation with reduced artifacts and improved temporal consistency, maintaining consistent appearance attributes over time.
  • Figure 4: Qualitative comparison on 60 second video generation. We compare our method against four baselines. While prior methods suffer from identity drift or scene collapse over long durations, our model maintains high subject consistency and visual quality. The right panels (Character A-E) highlight our model's ability to preserve fine-grained character identities throughout the entire 60s sequence.
  • Figure 5: Impact of covariance coefficient $\rho$ on VBench metrics.
  • ...and 10 more figures

Theorems & Definitions (4)

  • Proposition 3.1: Distributional Invariance and Structure
  • Proposition 3.2: Energy Monotonicity
  • Proposition 2.1
  • proof