Table of Contents
Fetching ...

Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Pinar Yanardag

TL;DR

∞-RoPE proposes a training-free, inference-time framework that upgrades short-horizon autoregressive video diffusion models to infinite-horizon, controllable generation. It combines Block-Relativistic RoPE to remove fixed temporal limits, KV Flush for instant action changes with constant memory, and RoPE Cut for cinematic scene transitions, all while preserving identity and scene coherence. Across extensive qualitative and quantitative tests, ∞-RoPE achieves state-of-the-art long-horizon performance on VBench, demonstrates precise action control, and enables multi-cut cinematic sequences within a single rollout. The approach offers a practical path to scalable, temporally robust, and user-controllable long-form video synthesis without retraining or additional data.

Abstract

Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce $\infty$-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish $\infty$-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that $\infty$-RoPE consistently surpasses previous autoregressive models in overall VBench scores.

Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

TL;DR

∞-RoPE proposes a training-free, inference-time framework that upgrades short-horizon autoregressive video diffusion models to infinite-horizon, controllable generation. It combines Block-Relativistic RoPE to remove fixed temporal limits, KV Flush for instant action changes with constant memory, and RoPE Cut for cinematic scene transitions, all while preserving identity and scene coherence. Across extensive qualitative and quantitative tests, ∞-RoPE achieves state-of-the-art long-horizon performance on VBench, demonstrates precise action control, and enables multi-cut cinematic sequences within a single rollout. The approach offers a practical path to scalable, temporally robust, and user-controllable long-form video synthesis without retraining or additional data.

Abstract

Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce -RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish -RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that -RoPE consistently surpasses previous autoregressive models in overall VBench scores.

Paper Structure

This paper contains 25 sections, 11 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: $\infty$-RoPE demonstrates three core capabilities: Infinite-length video generation enabled by Block-Relativistic RoPE, fine-grained action-control through KV Flush, and cinematic multi-cut scene composition via RoPE Cut.
  • Figure 2: Motivation. Thirty-second video generation with Self-Forcing combined with our method (Top) Self-Forcing alone cannot sustain dynamic long-form generation. (Bottom) When augmented with Block-Relativistic RoPE, a Self-Forcing model trained only on five-second videos produces highly dynamic, high-quality long-form sequences.
  • Figure 3: Block-Relativistic RoPE.(a) Fixed cache size. As new latent blocks are generated, their temporal RoPE coordinates are rotated relative to the teacher’s maximum horizon $f_{\text{limit}}$, while earlier latents are rotated backward to preserve their relative temporal geometry within the fixed cache window. (b) Unbounded cache size. When the KV cache grows beyond $f_{\text{limit}}$, earlier latents undergo semanticization: Temporally distant tokens collapse into abstract semantic memory, while recent high-SNR tokens retain precise temporal geometry. See Sec. \ref{['subsec:block_relativistic_rope']} for details.
  • Figure 4: KV Flush. KV Flush resets the KV cache to only two tokens, the global sink and the last latent frame, so that a new prompt takes effect immediately without carrying over old semantics. Compared to no-cache (abrupt changes), full-cache (semantic lag), and KV re-cache (high latency), KV Flush achieves instant, clean action responsiveness with smooth temporal continuity, as shown in the prompt sequence: standing → jumping → sitting → singing.
  • Figure 5: RoPE Cut. RoPE Cut enables a discontinuous jump along the temporal RoPE axis. In the first rollout (second row), the active latent block $B_{6}=\{4,5,6\}$ is reassigned to a new RoPE-local frame, becoming $B_{4\rightarrow 21}=\{4,20,21\}$: the token "4" is kept as the local anchor while the next two tokens are reassigned to the high-SNR positions $20$ and $21$ for continued denoising. In the subsequent rollout (third row), the block $\{4,20,21\}$ is treated as past context after the cut, and generation proceeds again from the original temporal location with a fresh block $B_{6}=\{4,5,6\}$ inside the fixed cache horizon.
  • ...and 8 more figures