Table of Contents
Fetching ...

MemRoPE: Training-Free Infinite Video Generation via Evolving Memory Tokens

Youngrae Kim, Qixin Hu, C. -C. Jay Kuo, Peter A. Beerel

Abstract

Autoregressive diffusion enables real-time frame streaming, yet existing sliding-window caches discard past context, causing fidelity degradation, identity drift, and motion stagnation over long horizons. Current approaches preserve a fixed set of early tokens as attention sinks, but this static anchor cannot reflect the evolving content of a growing video. We introduce MemRoPE, a training-free framework with two co-designed components. Memory Tokens continuously compress all past keys into dual long-term and short-term streams via exponential moving averages, maintaining both global identity and recent dynamics within a fixed-size cache. Online RoPE Indexing caches unrotated keys and applies positional embeddings dynamically at attention time, ensuring the aggregation is free of conflicting positional phases. These two mechanisms are mutually enabling: positional decoupling makes temporal aggregation well-defined, while aggregation makes fixed-size caching viable for unbounded generation. Extensive experiments validate that MemRoPE outperforms existing methods in temporal coherence, visual fidelity, and subject consistency across minute- to hour-scale generation.

MemRoPE: Training-Free Infinite Video Generation via Evolving Memory Tokens

Abstract

Autoregressive diffusion enables real-time frame streaming, yet existing sliding-window caches discard past context, causing fidelity degradation, identity drift, and motion stagnation over long horizons. Current approaches preserve a fixed set of early tokens as attention sinks, but this static anchor cannot reflect the evolving content of a growing video. We introduce MemRoPE, a training-free framework with two co-designed components. Memory Tokens continuously compress all past keys into dual long-term and short-term streams via exponential moving averages, maintaining both global identity and recent dynamics within a fixed-size cache. Online RoPE Indexing caches unrotated keys and applies positional embeddings dynamically at attention time, ensuring the aggregation is free of conflicting positional phases. These two mechanisms are mutually enabling: positional decoupling makes temporal aggregation well-defined, while aggregation makes fixed-size caching viable for unbounded generation. Extensive experiments validate that MemRoPE outperforms existing methods in temporal coherence, visual fidelity, and subject consistency across minute- to hour-scale generation.
Paper Structure (40 sections, 7 equations, 13 figures, 8 tables, 1 algorithm)

This paper contains 40 sections, 7 equations, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: Training-free unbounded video generation. Our MemRoPE requires no additional training and enables unlimited generation with a fixed-size KV cache. We demonstrate a continuous one-hour generation process that perfectly preserves both subject identity and visual fidelity throughout.
  • Figure 2: KV cache structures for long video generation.(a) FIFO eviction selfforcingcausvidlongliveinfinityrope maintains an initial sink frame while discarding the oldest remaining frames when the cache is full, losing distant context. (b) Deep Forcing deepforcing dedicates over half the cache to static sink tokens and selects a small number of compressed tokens via attention-based importance scoring, which often causes temporal instability in the generated sequence (see \ref{['fig:pc_failure']}). (c) MemRoPE preserves a sink frame and manages distinct long- and short-termmemories (\ref{['sec:memory']}). By storing all keys without RoPE, it prevents positional interference from corrupting the stored features, thereby enabling stable memory aggregation (\ref{['sec:online_rope']}).
  • Figure 3: Failure mode of Participative Compression.(a) Participative Compression (PC), proposed in Deep Forcing deepforcing, rapidly converges to retaining the same long-persisted tokens in the compressed frames, discarding most newly arriving tokens. (b) The few newly admitted tokens carry high attention scores, so each rare cache update exerts a disproportionately strong influence on generation. (c, d) This causes frame-to-frame instability: consecutive SSIM drops and LPIPS spikes indicate abrupt visual shifts whenever the cache content changes. Our EMA memory evolves continuously, maintaining smooth transitions.
  • Figure 4: Overview of MemRoPE.(a) At each autoregressive step, the three-tier KV cache (sink, memory, and recent tokens) is concatenated with the current noisy chunk and fed into the DiT. When the local window is full, the oldest frames are absorbed into the long- and short-term memory tokens via dual EMA updates. (b) All cached keys are stored without RoPE, enabling temporal aggregation for memory tokens. At attention time, contiguous block-relative indices are assigned to the full sequence, and RoPE is applied on the fly, ensuring that indices never exceed the training range.
  • Figure 5: Qualitative comparison on 2-minute video generation. MemRoPE maintains subject identity and background consistency throughout, whereas baselines exhibit progressive degradation, including structural collapse and color corruption.
  • ...and 8 more figures