Table of Contents
Fetching ...

Pathwise Test-Time Correction for Autoregressive Long Video Generation

Xunzhi Xiang, Zixuan Duan, Guiyu Zhang, Haiyu Zhang, Zhe Gao, Junta Wu, Shaofeng Zhang, Tengfei Wang, Qi Fan, Chunchao Guo

TL;DR

This work tackles the problem of error accumulation in autoregressive diffusion-based long-horizon video generation with distilled few-step samplers. It introduces Test-Time Correction (TTC), a training-free, path-aware intervention that anchors intermediate states to the initial frame and integrates on-path re-noising, enabling stable 30-second video generation without model retraining. Through extensive ablations and comparisons against baselines and training-based methods, TTC demonstrates substantial reductions in temporal drift and improved temporal coherence while incurring negligible overhead. The approach is validated across multiple distilled architectures, offering a practical route to reliable real-time long-video synthesis.

Abstract

Distilled autoregressive diffusion models facilitate real-time short video synthesis but suffer from severe error accumulation during long-sequence generation. While existing Test-Time Optimization (TTO) methods prove effective for images or short clips, we identify that they fail to mitigate drift in extended sequences due to unstable reward landscapes and the hypersensitivity of distilled parameters. To overcome these limitations, we introduce Test-Time Correction (TTC), a training-free alternative. Specifically, TTC utilizes the initial frame as a stable reference anchor to calibrate intermediate stochastic states along the sampling trajectory. Extensive experiments demonstrate that our method seamlessly integrates with various distilled models, extending generation lengths with negligible overhead while matching the quality of resource-intensive training-based methods on 30-second benchmarks.

Pathwise Test-Time Correction for Autoregressive Long Video Generation

TL;DR

This work tackles the problem of error accumulation in autoregressive diffusion-based long-horizon video generation with distilled few-step samplers. It introduces Test-Time Correction (TTC), a training-free, path-aware intervention that anchors intermediate states to the initial frame and integrates on-path re-noising, enabling stable 30-second video generation without model retraining. Through extensive ablations and comparisons against baselines and training-based methods, TTC demonstrates substantial reductions in temporal drift and improved temporal coherence while incurring negligible overhead. The approach is validated across multiple distilled architectures, offering a practical route to reliable real-time long-video synthesis.

Abstract

Distilled autoregressive diffusion models facilitate real-time short video synthesis but suffer from severe error accumulation during long-sequence generation. While existing Test-Time Optimization (TTO) methods prove effective for images or short clips, we identify that they fail to mitigate drift in extended sequences due to unstable reward landscapes and the hypersensitivity of distilled parameters. To overcome these limitations, we introduce Test-Time Correction (TTC), a training-free alternative. Specifically, TTC utilizes the initial frame as a stable reference anchor to calibrate intermediate stochastic states along the sampling trajectory. Extensive experiments demonstrate that our method seamlessly integrates with various distilled models, extending generation lengths with negligible overhead while matching the quality of resource-intensive training-based methods on 30-second benchmarks.
Paper Structure (15 sections, 18 equations, 13 figures, 7 tables, 1 algorithm)

This paper contains 15 sections, 18 equations, 13 figures, 7 tables, 1 algorithm.

Figures (13)

  • Figure 1: 30-second video generation examples. Our method reduces error accumulation in CausVid and Self-Forcing, enabling longer and more stable videos with improved visual consistency. All samples are generated with the same random seed for fair comparison.
  • Figure 2: Comparison of sampling strategies. The Original Path suffers from error accumulation, while the Sink-based Path collapses into a Sink Point (dynamic collapse). In contrast, our TTC strategy avoids these failures by employing reference-conditioned denoising and explicit Re-noising, effectively steering the trajectory away from the sink to preserve target distribution.
  • Figure 3: Variants of autoregressive video generation. Discrete AR uses single-step deterministic prediction, multi-step diffusion follows a deterministic ODE trajectory, while few-step distilled diffusion performs stochastic sampling with intermediate noise injection.
  • Figure 4: Comparison of two toy test-time optimization variants based on LoRA fine-tuning.
  • Figure 5: Overall pipeline of our method. A sparse set of correction steps is inserted into the stochastic sampling path until the global structure stabilizes. At selected steps, TTC performs reference-conditioned denoising using the initial frame to obtain a corrected prediction, which is then re-noised to the current timestep to remain consistent with the expected noise distribution. This on-path, training-free correction suppresses long-term error accumulation and stabilizes long-horizon generation.
  • ...and 8 more figures