Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation
Tianrui Zhu, Shiyi Zhang, Zhirui Sun, Jingqi Tian, Yansong Tang
TL;DR
The paper tackles the memory–consistency trade-off in long-term real-time video generation by introducing Memorize-and-Generate (MAG), which decouples historical memory compression from frame synthesis via a memory KV-cache and a separate generator. It provides a two-stage training regime and a dedicated MAG-Bench benchmark to evaluate historical retention, achieving near-lossless memory compression (3x) and real-time generation (≈16–21 FPS) with superior historical consistency compared with existing methods. The approach demonstrates strong performance on both short and minute-scale tasks and includes extensive ablations confirming the necessity of memory compression and history-focused training. MAG-Bench enables rigorous assessment of historical scene retention, contributing a practical framework for long-duration video generation research and potential world-model applications.
Abstract
Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose \textbf{Memorize-and-Generate (MAG)}, a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce \textbf{MAG-Bench} to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.
