Table of Contents
Fetching ...

Anchor Forcing: Anchor Memory and Tri-Region RoPE for Interactive Streaming Video Diffusion

Yang Yang, Tianyi Zhang, Wei Huang, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, Peng-Tao Jiang

Abstract

Interactive long video generation requires prompt switching to introduce new subjects or events, while maintaining perceptual fidelity and coherent motion over extended horizons. Recent distilled streaming video diffusion models reuse a rolling KV cache for long-range generation, enabling prompt-switch interaction through re-cache at each switch. However, existing streaming methods still exhibit progressive quality degradation and weakened motion dynamics. We identify two failure modes specific to interactive streaming generation: (i) at each prompt switch, current cache maintenance cannot simultaneously retain KV-based semantic context and recent latent cues, resulting in weak boundary conditioning and reduced perceptual quality; and (ii) during distillation, unbounded time indexing induces a positional distribution shift from the pretrained backbone's bounded RoPE regime, weakening pretrained motion priors and long-horizon motion retention. To address these issues, we propose \textbf{Anchor Forcing}, a cache-centric framework with two designs. First, an anchor-guided re-cache mechanism stores KV states in anchor caches and warm-starts re-cache from these anchors at each prompt switch, reducing post-switch evidence loss and stabilizing perceptual quality. Second, a tri-region RoPE with region-specific reference origins, together with RoPE re-alignment distillation, reconciles unbounded streaming indices with the pretrained RoPE regime to better retain motion priors. Experiments on long videos show that our method improves perceptual quality and motion metrics over prior streaming baselines in interactive settings. Project page: https://github.com/vivoCameraResearch/Anchor-Forcing

Anchor Forcing: Anchor Memory and Tri-Region RoPE for Interactive Streaming Video Diffusion

Abstract

Interactive long video generation requires prompt switching to introduce new subjects or events, while maintaining perceptual fidelity and coherent motion over extended horizons. Recent distilled streaming video diffusion models reuse a rolling KV cache for long-range generation, enabling prompt-switch interaction through re-cache at each switch. However, existing streaming methods still exhibit progressive quality degradation and weakened motion dynamics. We identify two failure modes specific to interactive streaming generation: (i) at each prompt switch, current cache maintenance cannot simultaneously retain KV-based semantic context and recent latent cues, resulting in weak boundary conditioning and reduced perceptual quality; and (ii) during distillation, unbounded time indexing induces a positional distribution shift from the pretrained backbone's bounded RoPE regime, weakening pretrained motion priors and long-horizon motion retention. To address these issues, we propose \textbf{Anchor Forcing}, a cache-centric framework with two designs. First, an anchor-guided re-cache mechanism stores KV states in anchor caches and warm-starts re-cache from these anchors at each prompt switch, reducing post-switch evidence loss and stabilizing perceptual quality. Second, a tri-region RoPE with region-specific reference origins, together with RoPE re-alignment distillation, reconciles unbounded streaming indices with the pretrained RoPE regime to better retain motion priors. Experiments on long videos show that our method improves perceptual quality and motion metrics over prior streaming baselines in interactive settings. Project page: https://github.com/vivoCameraResearch/Anchor-Forcing
Paper Structure (16 sections, 8 equations, 5 figures, 6 tables)

This paper contains 16 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: We propose Anchor Forcing, which supports prompt switches that introduce new subjects and actions while preserving context, motion quality, and temporal coherence across clips. In contrast, prior methods degrade over time and often fail to realize newly introduced interactions, as highlighted by the red boxes. Red text denotes the interaction newly specified in each segment.
  • Figure 2: The Anchor Forcing pipeline. (a) Overview of Anchor Forcing in an interactive setting with two prompt switches. We denote the anchor memory for generating frame $t$ as $\mathcal{M}(t)$, and apply anchor-guided re-cache at $f_1$ and $f_2$ to update the local KV cache under the new prompt condition. (b) Prior re-cache yang2025longlive. It rebuilds the local cache solely from historical frame latents, which fails to retain prior KV evidence across prompt switches. (c) Anchor-guided re-cache at $f_2$. It augments re-cache with the anchor memory $\mathcal{M}(6)$ and refreshes the junction caches $x_5$.
  • Figure 3: Qualitative comparison on interactive long-video generation. Compared with baselines, Anchor Forcing achieves stronger prompt compliance, more coherent dynamic motion, and higher long-range visual quality. Red text marks the interactive content newly introduced in each segment.
  • Figure 4: Qualitative comparison on 30-second long-video generation. Compared with prior methods, ours method yields higher-quality, more prompt-faithful videos, avoiding the background and content degradations highlighted in the yellow and red boxes.
  • Figure 5: Ablation on 60-second interactive generation. Tri-region RoPE improves motion dynamics, and anchor-guided re-cache further preserves perceptual quality and prompt compliance across segments. Red boxes mark degradation between adjacent clips, and red text denotes newly introduced interactions.