Table of Contents
Fetching ...

Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation

Tianrui Zhu, Shiyi Zhang, Zhirui Sun, Jingqi Tian, Yansong Tang

TL;DR

The paper tackles the memory–consistency trade-off in long-term real-time video generation by introducing Memorize-and-Generate (MAG), which decouples historical memory compression from frame synthesis via a memory KV-cache and a separate generator. It provides a two-stage training regime and a dedicated MAG-Bench benchmark to evaluate historical retention, achieving near-lossless memory compression (3x) and real-time generation (≈16–21 FPS) with superior historical consistency compared with existing methods. The approach demonstrates strong performance on both short and minute-scale tasks and includes extensive ablations confirming the necessity of memory compression and history-focused training. MAG-Bench enables rigorous assessment of historical scene retention, contributing a practical framework for long-duration video generation research and potential world-model applications.

Abstract

Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose \textbf{Memorize-and-Generate (MAG)}, a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce \textbf{MAG-Bench} to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.

Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation

TL;DR

The paper tackles the memory–consistency trade-off in long-term real-time video generation by introducing Memorize-and-Generate (MAG), which decouples historical memory compression from frame synthesis via a memory KV-cache and a separate generator. It provides a two-stage training regime and a dedicated MAG-Bench benchmark to evaluate historical retention, achieving near-lossless memory compression (3x) and real-time generation (≈16–21 FPS) with superior historical consistency compared with existing methods. The approach demonstrates strong performance on both short and minute-scale tasks and includes extensive ablations confirming the necessity of memory compression and history-focused training. MAG-Bench enables rigorous assessment of historical scene retention, contributing a practical framework for long-duration video generation research and potential world-model applications.

Abstract

Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose \textbf{Memorize-and-Generate (MAG)}, a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce \textbf{MAG-Bench} to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.

Paper Structure

This paper contains 15 sections, 3 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The inference pipeline. MAG performs real-time streaming video generation at 16 FPS on a single GPU. Compared to baselines, MAG achieves $3\times$ memory compression. Simultaneously, MAG is capable of generating scenes beyond the current field of view based on memory, ensuring global historical consistency.
  • Figure 2: The training pipeline. The training process of MAG comprises two stages. In the first stage, we train the memory model for the triple compressed KV cache, retaining only one frame within a full attention block. The loss function requires the model to reconstruct the pixels of all frames in the block from the compressed cache. The process utilizes a customized attention mask to achieve efficient parallel training. In the second stage, we train the generator model within the long video DMD training framework to adapt to the compressed cache provided by the frozen memory model.
  • Figure 3: The attention mask of memory model training. We achieve efficient parallel training of the encode-decode process by concatenating noise and clean frame sequences. By masking out the KV cache of other frames within the block, the model is forced to compress information into the target cache.
  • Figure 4: Examples from MAG-Bench. MAG-Bench is a lightweight benchmark comprising 176 videos featuring indoor, outdoor, object, and video game scenes. The benchmark also provides appropriate switch times to guide the model toward correct continuation using a few frames.
  • Figure 5: Visualization of Memory Model reconstruction results. We display two examples featuring texture detail variations and significant camera movement. Visually, the trained Memory Model achieves near-lossless reconstruction of the original pixels under a $3\times$ compression setting.
  • ...and 2 more figures