Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
Haodong Li, Shaoteng Liu, Zhe Lin, Manmohan Chandraker
TL;DR
This work tackles long-horizon drift in autoregressive video diffusion caused by exposure bias between limited training horizon $T_{train}$ and open-ended testing $T_{test}$. It analyzes AR cache maintenance and introduces a training-free Rolling Sink that keeps the cache within a fixed budget $K$ with $S$ sink blocks, augmented by Sliding Indices and Sliding Semantics to maintain drift-free rollouts. Built on Self Forcing trained on $5$ s clips, Rolling Sink enables ultra-long synthesis (5–30 minutes at 16 FPS) with stable identities, coherent structure, and smooth motion, outperforming SOTA baselines on long-horizon metrics. This approach offers a practical path toward open-ended AR video generation with preserved quality, though extension to multi-shot scenarios remains a future direction.
Abstract
Recently, autoregressive (AR) video diffusion models has achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes at 16 FPS) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. As demonstrated by extensive experiments, Rolling Sink achieves superior long-horizon visual fidelity and temporal consistency compared to SOTA baselines. Project page: https://rolling-sink.github.io/
