Table of Contents
Fetching ...

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

Haodong Li, Shaoteng Liu, Zhe Lin, Manmohan Chandraker

TL;DR

This work tackles long-horizon drift in autoregressive video diffusion caused by exposure bias between limited training horizon $T_{train}$ and open-ended testing $T_{test}$. It analyzes AR cache maintenance and introduces a training-free Rolling Sink that keeps the cache within a fixed budget $K$ with $S$ sink blocks, augmented by Sliding Indices and Sliding Semantics to maintain drift-free rollouts. Built on Self Forcing trained on $5$ s clips, Rolling Sink enables ultra-long synthesis (5–30 minutes at 16 FPS) with stable identities, coherent structure, and smooth motion, outperforming SOTA baselines on long-horizon metrics. This approach offers a practical path toward open-ended AR video generation with preserved quality, though extension to multi-shot scenarios remains a future direction.

Abstract

Recently, autoregressive (AR) video diffusion models has achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes at 16 FPS) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. As demonstrated by extensive experiments, Rolling Sink achieves superior long-horizon visual fidelity and temporal consistency compared to SOTA baselines. Project page: https://rolling-sink.github.io/

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

TL;DR

This work tackles long-horizon drift in autoregressive video diffusion caused by exposure bias between limited training horizon and open-ended testing . It analyzes AR cache maintenance and introduces a training-free Rolling Sink that keeps the cache within a fixed budget with sink blocks, augmented by Sliding Indices and Sliding Semantics to maintain drift-free rollouts. Built on Self Forcing trained on s clips, Rolling Sink enables ultra-long synthesis (5–30 minutes at 16 FPS) with stable identities, coherent structure, and smooth motion, outperforming SOTA baselines on long-horizon metrics. This approach offers a practical path toward open-ended AR video generation with preserved quality, though extension to multi-shot scenarios remains a future direction.

Abstract

Recently, autoregressive (AR) video diffusion models has achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes at 16 FPS) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. As demonstrated by extensive experiments, Rolling Sink achieves superior long-horizon visual fidelity and temporal consistency compared to SOTA baselines. Project page: https://rolling-sink.github.io/
Paper Structure (21 sections, 12 equations, 29 figures, 11 tables)

This paper contains 21 sections, 12 equations, 29 figures, 11 tables.

Figures (29)

  • Figure 1: Rollin18,177,238203,61,187g Sink203,61,18718,177,238 unlocks open-ended AR video generation. Despite a 5s training duration, Rolling Sink effectively scales the AR video synthesis to minutes long during testing, e.g, 5-minute and 30-minute (please see Fig. \ref{['fig:ultra_long']}, \ref{['fig:ultra_long2']} in our Supp$^\ref{['fn:supp']}$).
  • Figure 2: Bridging the gap between limited-horizon training and open-ended testing. Self Forcing huang2025self studies the train-test gap when testing within the training window (i.e., 5s at 16 FPS), while we extend the focus to the train-test gap that emerges when testing beyond this training window.
  • Figure 3: Overview of our analysis and the proposed Rollin18,177,238203,61,187g Sink203,61,18718,177,238. ⓐ The caching mechanism of Self Forcing huang2025self, the total cache capacity $K$ is strictly bounded for streaming efficiency. ⓑ We first apply Attention Sink (i.e., pinning the first $S$ blocks as sink blocks where both the time indices and semantics are static), and analyze the effect of different sink ratios ($\frac{S}{K}$). ⓒ Sliding Indices: Treating the time indices as a global axis $i\in[0,\infty)$, at each AR step $i$, we shift sink blocks' time indices as a fixed-length (i.e., $S$) sliding window on this axis. ⓓ Sliding Semantics: Ideally, the sink blocks' semantic content should also slide along the a drift-free, global video manifold that lasts endlessly. Since finite-length training cannot naturally realize this, we approximate the true semantic sliding by rolling the sink content (i.e., at each AR step, we update the sink blocks' semantic content with a rolling segment from the within-duration history). Finally, we propose ⓓ and name it Rollin18,177,238203,61,187g Sink203,61,18718,177,238. For clarity, here we set $K=3$ and $S=2$. Please see Sec. \ref{['sec:sys_ana']} for more technical details.
  • Figure 4: Evaluation results during the systematic analysis, on both 1-minute (left) and 5-minute (right) AR video synthesis. The video quality score is the averaged score across all dimensions tested in VBench-Longhuang2023vbenchhuang2025vbench++zheng2025vbench2. As illustrated, the video quality is consistently improved during our systematic analysis and the derived Rolling Sink yields the best performance (particularly when $\frac{S}{K}=83\%$). Please see Supp's Sec. \ref{['sec:more_ana_eval']} for the specific numerical results of all dimensions.
  • Figure 5: Visual comparisons across various sink sizes. Larger sink sizes stabilize colors. But noticeable AR drift still persists, e.g., frame flickers. Here we set $t=60\text{s}$.
  • ...and 24 more figures