Table of Contents
Fetching ...

AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories

Zun Wang, Han Lin, Jaehong Yoon, Jaemin Cho, Yue Zhang, Mohit Bansal

TL;DR

AnchorWeave tackles the challenge of long-horizon world consistency in camera-controlled video generation by replacing global 3D memory with multiple per-frame local geometric memories. It introduces coverage-driven memory retrieval to select complementary local memories and a multi-anchor weaving controller with shared cross-anchor attention and pose-guided fusion to coherently condition generation on several anchors. The approach yields substantial improvements in long-term spatial consistency and visual quality, validated on RealEstate10K and DL3DV with comprehensive ablations and open-domain tests. By updating memories iteratively through an update–retrieve–generate loop, AnchorWeave generalizes to diverse environments and complex camera trajectories, offering a scalable, memory-aware alternative to global 3D fusion.

Abstract

Maintaining spatial world consistency over long horizons remains a central challenge for camera-controllable video generation. Existing memory-based approaches often condition generation on globally reconstructed 3D scenes by rendering anchor videos from the reconstructed geometry in the history. However, reconstructing a global 3D scene from multiple views inevitably introduces cross-view misalignment, as pose and depth estimation errors cause the same surfaces to be reconstructed at slightly different 3D locations across views. When fused, these inconsistencies accumulate into noisy geometry that contaminates the conditioning signals and degrades generation quality. We introduce AnchorWeave, a memory-augmented video generation framework that replaces a single misaligned global memory with multiple clean local geometric memories and learns to reconcile their cross-view inconsistencies. To this end, AnchorWeave performs coverage-driven local memory retrieval aligned with the target trajectory and integrates the selected local memories through a multi-anchor weaving controller during generation. Extensive experiments demonstrate that AnchorWeave significantly improves long-term scene consistency while maintaining strong visual quality, with ablation and analysis studies further validating the effectiveness of local geometric conditioning, multi-anchor control, and coverage-driven retrieval.

AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories

TL;DR

AnchorWeave tackles the challenge of long-horizon world consistency in camera-controlled video generation by replacing global 3D memory with multiple per-frame local geometric memories. It introduces coverage-driven memory retrieval to select complementary local memories and a multi-anchor weaving controller with shared cross-anchor attention and pose-guided fusion to coherently condition generation on several anchors. The approach yields substantial improvements in long-term spatial consistency and visual quality, validated on RealEstate10K and DL3DV with comprehensive ablations and open-domain tests. By updating memories iteratively through an update–retrieve–generate loop, AnchorWeave generalizes to diverse environments and complex camera trajectories, offering a scalable, memory-aware alternative to global 3D fusion.

Abstract

Maintaining spatial world consistency over long horizons remains a central challenge for camera-controllable video generation. Existing memory-based approaches often condition generation on globally reconstructed 3D scenes by rendering anchor videos from the reconstructed geometry in the history. However, reconstructing a global 3D scene from multiple views inevitably introduces cross-view misalignment, as pose and depth estimation errors cause the same surfaces to be reconstructed at slightly different 3D locations across views. When fused, these inconsistencies accumulate into noisy geometry that contaminates the conditioning signals and degrades generation quality. We introduce AnchorWeave, a memory-augmented video generation framework that replaces a single misaligned global memory with multiple clean local geometric memories and learns to reconcile their cross-view inconsistencies. To this end, AnchorWeave performs coverage-driven local memory retrieval aligned with the target trajectory and integrates the selected local memories through a multi-anchor weaving controller during generation. Extensive experiments demonstrate that AnchorWeave significantly improves long-term scene consistency while maintaining strong visual quality, with ablation and analysis studies further validating the effectiveness of local geometric conditioning, multi-anchor control, and coverage-driven retrieval.
Paper Structure (26 sections, 1 equation, 12 figures, 4 tables, 1 algorithm)

This paper contains 26 sections, 1 equation, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: Global 3D reconstruction accumulates cross-view misalignment, introducing artifacts in the reconstructed geometry ((a), middle), which propagate to the generated video as hallucinations ((b), red boxes). In contrast, per-frame local geometry inherently avoids cross-view misalignment and therefore remains clean ((a), right). Conditioning on multiple retrieved local geometric anchors, AnchorWeave maintains strong spatial consistency with the historical frames ((c), white and green boxes).
  • Figure 2: Coverage-driven memory retrieval pipeline. Given a target camera, we first select local memories whose camera FoVs partially overlap with the target camera view to form a candidate memory pool. At each retrieval step, we greedily select the memory that maximizes the newly covered visible area. Points invisible to the target camera are shown in gray. S$i$–M$j$ denotes memory $j$ selected at retrieval step $i$, and the red box indicates the retrieved memory. In S$i$-M$j$, regions already covered by previously retrieved memories are highlighted in green and only newly covered regions retain their original RGB colors. No green regions appear in S1-M$j$ since 1st-step's coverage is empty. Retrieval terminates when the uncovered region is 0%, the retrieval budget $K$ is exhausted, or the remaining memory pool is empty. For clarity, coverage is computed with a single frame here, while in practice is aggregated over multiple frames per chunk.
  • Figure 3: Architecture of multi-anchor weaving controller. Anchors are encoded and jointly processed by a shared attention block, followed by camera-pose-guided fusion to produce a unified control signal injected into the backbone model. Camera 1 to $K$ represent the retrieved-to-target camera poses for the 1 to $K$ anchor videos, where each denotes the relative pose between the camera associated with a retrieved local point cloud and the target camera, measuring their viewpoint proximity. Camera 0 is the relative target camera trajectory.
  • Figure 4: Qualitative comparison with baselines on DL3DV. K=1 means one retrieval per chunk. Baseline methods suffer from spatial drift and inconsistency in details. In contrast, AnchorWeave (ours) maintains consistency for multiple cases. GT and the historical context are shown for reference. For clarity, representative misaligned regions are highlighted with red boxes for strong baselines (e.g., SEVA).
  • Figure 5: Ablation study. (a) Pose-guided fusion suppresses misaligned anchors and reduces artifacts compared to simple averaging. (b) Joint attention outperforms separate attention, enabling coherent multi-anchor aggregation.
  • ...and 7 more figures