Table of Contents
Fetching ...

Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft

Junchao Huang, Xinting Hu, Boyao Han, Shaoshuai Shi, Zhuotao Tian, Tianyu He, Li Jiang

TL;DR

This paper tackles the challenge of maintaining spatial consistency over long sequences in autoregressive diffusion-based scene generation, using Minecraft as a testbed. It introduces Memory Forcing, a framework that blends temporal memory with a geometry-indexed spatial memory, reinforced by Hybrid Training and Chained Forward Training. A key novelty is Point-to-Frame Retrieval coupled with Incremental 3D Reconstruction, enabling efficient, geometry-aware memory access with constant-time lookups. Empirical results show superior long-term memory, generalization, and generation quality, along with significant improvements in memory efficiency and retrieval speed compared to baselines. The work advances scalable, consistent world modeling for interactive environments under fixed context constraints.

Abstract

Autoregressive video diffusion models have proved effective for world modeling and interactive scene generation, with Minecraft gameplay as a representative application. To faithfully simulate play, a model must generate natural content while exploring new scenes and preserve spatial consistency when revisiting explored areas. Under limited computation budgets, it must compress and exploit historical cues within a finite context window, which exposes a trade-off: Temporal-only memory lacks long-term spatial consistency, whereas adding spatial memory strengthens consistency but may degrade new scene generation quality when the model over-relies on insufficient spatial context. We present Memory Forcing, a learning framework that pairs training protocols with a geometry-indexed spatial memory. Hybrid Training exposes distinct gameplay regimes, guiding the model to rely on temporal memory during exploration and incorporate spatial memory for revisits. Chained Forward Training extends autoregressive training with model rollouts, where chained predictions create larger pose variations and encourage reliance on spatial memory for maintaining consistency. Point-to-Frame Retrieval efficiently retrieves history by mapping currently visible points to their source frames, while Incremental 3D Reconstruction maintains and updates an explicit 3D cache. Extensive experiments demonstrate that Memory Forcing achieves superior long-term spatial consistency and generative quality across diverse environments, while maintaining computational efficiency for extended sequences.

Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft

TL;DR

This paper tackles the challenge of maintaining spatial consistency over long sequences in autoregressive diffusion-based scene generation, using Minecraft as a testbed. It introduces Memory Forcing, a framework that blends temporal memory with a geometry-indexed spatial memory, reinforced by Hybrid Training and Chained Forward Training. A key novelty is Point-to-Frame Retrieval coupled with Incremental 3D Reconstruction, enabling efficient, geometry-aware memory access with constant-time lookups. Empirical results show superior long-term memory, generalization, and generation quality, along with significant improvements in memory efficiency and retrieval speed compared to baselines. The work advances scalable, consistent world modeling for interactive environments under fixed context constraints.

Abstract

Autoregressive video diffusion models have proved effective for world modeling and interactive scene generation, with Minecraft gameplay as a representative application. To faithfully simulate play, a model must generate natural content while exploring new scenes and preserve spatial consistency when revisiting explored areas. Under limited computation budgets, it must compress and exploit historical cues within a finite context window, which exposes a trade-off: Temporal-only memory lacks long-term spatial consistency, whereas adding spatial memory strengthens consistency but may degrade new scene generation quality when the model over-relies on insufficient spatial context. We present Memory Forcing, a learning framework that pairs training protocols with a geometry-indexed spatial memory. Hybrid Training exposes distinct gameplay regimes, guiding the model to rely on temporal memory during exploration and incorporate spatial memory for revisits. Chained Forward Training extends autoregressive training with model rollouts, where chained predictions create larger pose variations and encourage reliance on spatial memory for maintaining consistency. Point-to-Frame Retrieval efficiently retrieves history by mapping currently visible points to their source frames, while Incremental 3D Reconstruction maintains and updates an explicit 3D cache. Extensive experiments demonstrate that Memory Forcing achieves superior long-term spatial consistency and generative quality across diverse environments, while maintaining computational efficiency for extended sequences.

Paper Structure

This paper contains 18 sections, 10 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Two paradigms of autoregressive video models and their fail cases. (a) Long-term spatial memory models maintain consistency when revisiting areas yet deteriorate in new environments. (b) Temporal memory models excel in new scenes yet lack spatial consistency when revisiting areas.
  • Figure 2: Memory Forcing Pipeline. Our framework combines spatial and temporal memory for video generation. 3D geometry is maintained through streaming reconstruction of key frames along the camera trajectory. During generation, Point-to-Frame Retrieval maps spatial context to historical frames, which are integrated with temporal memory and injected together via memory cross-attention in the DiT backbone. Chained Forward Training creates larger pose variations, encouraging the model to effectively utilize spatial memory for maintaining long-term geometric consistency.
  • Figure 3: Memory capability comparison across different models for maintaining spatial consistency and scene coherence when revisiting previously observed areas.
  • Figure 4: Generalization performance on unseen terrain types (top) and generation performance in new environments (bottom). Our method demonstrates superior visual quality and responsive movement dynamics, with distant scenes progressively becoming clearer as the agent approaches, while baselines show quality degradation, minimal distance variation, or oversimplified distant scenes.
  • Figure 5: Generalization performance on frozen ocean. When generating frozen ocean terrain, WorldMem xiao2025worldmem produces novel scenes resembling the plains terrain from the training set. By contrast, our model preserves the frozen ocean terrain across novel scene generations.
  • ...and 3 more figures