Table of Contents
Fetching ...

MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Wei Yu, Runjia Qian, Yumeng Li, Liquan Wang, Songheng Yin, Sri Siddarth Chakaravarthy P, Dennis Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg

Abstract

Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.

MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Abstract

Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.
Paper Structure (15 sections, 2 equations, 8 figures, 2 tables)

This paper contains 15 sections, 2 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Left: Memory mechanism comparison and visualization. MoscaicMem is a hybrid approach, unifies the strengths of explicit and implicit memory. Right: (A) MosaicMem achieves more accurate camera motion than implicit memory. (B) Compared to explicit memory, MosaicMem enables the generation of text-driven dynamics, whereas content generated by explicit memory remain static.
  • Figure 2: Method overview. Left: MosaicMem lifts patches into 3D, then gathers and stitches them in the target view like a mosaic. Middle: Architecture overview. Camera motion is controlled jointly by MosaicMem retrieval and PRoPE conditioning. Right: Retrieved mosaic patches are flattened and concatenated to the token sequence as conditioning, while alignment errors were solved by warping.
  • Figure 3: Training-free generation via direct Mosaic Memory injection. Without fine-tuning, the model places retrieved memory conditions at the targeted locations and modestly refines it.
  • Figure 4: (a) MosaicMem generates dynamic objects with temporal consistency, while GEN3C produces static scenes and VWM introduces artifacts despite supporting limited dynamics. (b) MosaicMem preserves prompt adherence to create composite scenes.
  • Figure 5: (a) Comparison of camera-controlled generation between the implicit memory baseline and MosaicMem. Combining PRoPE with MosaicMem improves camera motion control and enables precise spatial memory registration. (b) Without PRoPE, MosiacMem alone struggles with large rotation, where the camera enter previously unseen regions without Mosaic references, leading to significant camera errors.
  • ...and 3 more figures