Table of Contents
Fetching ...

VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory

Runjia Li, Philip Torr, Andrea Vedaldi, Tomas Jakab

TL;DR

We tackle long-term interactive video generation where a user-guided camera path must yield coherent revisits to the same scene. We introduce Surfel-Indexed View Memory (VMem), a memory module that anchors past views to surfels and retrieves the most relevant observations to condition new views, reducing the number of context frames required. The method combines a surfel-based memory index with an autoregressive view generator (SEVA backbone and LoRA-efficient variant) and demonstrates superior long-term coherence and efficiency on RealEstate10K and Tanks-and-Temples, including cycle trajectories. The results show VMem achieves up to ~12x faster inference with comparable or better quality using far fewer context views, enabling scalable, interactive scene exploration.

Abstract

We propose a novel memory module for building video generators capable of interactively exploring environments. Previous approaches have achieved similar results either by out-painting 2D views of a scene while incrementally reconstructing its 3D geometry-which quickly accumulates errors-or by using video generators with a short context window, which struggle to maintain scene coherence over the long term. To address these limitations, we introduce Surfel-Indexed View Memory (VMem), a memory module that remembers past views by indexing them geometrically based on the 3D surface elements (surfels) they have observed. VMem enables efficient retrieval of the most relevant past views when generating new ones. By focusing only on these relevant views, our method produces consistent explorations of imagined environments at a fraction of the computational cost required to use all past views as context. We evaluate our approach on challenging long-term scene synthesis benchmarks and demonstrate superior performance compared to existing methods in maintaining scene coherence and camera control.

VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory

TL;DR

We tackle long-term interactive video generation where a user-guided camera path must yield coherent revisits to the same scene. We introduce Surfel-Indexed View Memory (VMem), a memory module that anchors past views to surfels and retrieves the most relevant observations to condition new views, reducing the number of context frames required. The method combines a surfel-based memory index with an autoregressive view generator (SEVA backbone and LoRA-efficient variant) and demonstrates superior long-term coherence and efficiency on RealEstate10K and Tanks-and-Temples, including cycle trajectories. The results show VMem achieves up to ~12x faster inference with comparable or better quality using far fewer context views, enabling scalable, interactive scene exploration.

Abstract

We propose a novel memory module for building video generators capable of interactively exploring environments. Previous approaches have achieved similar results either by out-painting 2D views of a scene while incrementally reconstructing its 3D geometry-which quickly accumulates errors-or by using video generators with a short context window, which struggle to maintain scene coherence over the long term. To address these limitations, we introduce Surfel-Indexed View Memory (VMem), a memory module that remembers past views by indexing them geometrically based on the 3D surface elements (surfels) they have observed. VMem enables efficient retrieval of the most relevant past views when generating new ones. By focusing only on these relevant views, our method produces consistent explorations of imagined environments at a fraction of the computational cost required to use all past views as context. We evaluate our approach on challenging long-term scene synthesis benchmarks and demonstrate superior performance compared to existing methods in maintaining scene coherence and camera control.

Paper Structure

This paper contains 27 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: VMem enables autoregressive scene generation from a single image along user-defined trajectories. The green region shows results with the proposed memory module, maintaining coherence when generating previously seen parts of the scene. The red region, without memory, exhibits degradation highlighted with red ellipses, demonstrating VMem is effective for consistent scene generation.
  • Figure 2: Method. Given target camera viewpoints $\{\mathbf{c}_{T+m}\}_{m=1}^M$, we query our Surfel-Indexed View Memory to retrieve the most relevant $K$ past views $\mathcal{V}^* \subset \mathcal{V}^{(s)}$ where $\mathcal{V}^* = \{v_t\}_{t=1}^K$ as references. Retrieved reference images $\mathbf{x}_t$ along with Pl√ºcker embeddings of both reference camera poses $\mathbf{c}_t$ and target camera poses $\{\mathbf{c}_{T+m}\}_{m=1}^M$ are fed into generator $\psi$ to synthesize novel views $\{\mathbf{x}_{T+m}\}_{m=1}^M$. After generation, the surfel-indexed memory is updated $\mathcal{S}^{(s)} \rightarrow \mathcal{S}^{(s+1)}$ by appending new view indices $\{T+m\}_{m=1}^M$ to existing surfels or creating new surfels based on geometry of the generated views. This is repeated autoregressively, enabling long-term consistent generation.
  • Figure 3: Surfel-based memory index. Each surfel stores indices of views that observed it. We color-code each surfel by contributing view indices. This spatial index enables retrieval of relevant past views: when generating a novel view, we identify visible surfels from the target viewpoint and retrieve views that previously observed those same regions, naturally accounting for occlusion.
  • Figure 4: Surfel-Indexed View Memory. Reading procedure renders surfels $\mathcal{S}^{(s)}$ with their attributes, containing past view indices as frame indices. We then select the $K$ most frequent frame indices in the rendered image to retrieve relevant past views from $\mathcal{V}^{(s)}$. Writing procedure estimates geometry of newly generated views $\{\mathbf{x}_{T+m}\}_{m=1}^M$ as surfels and merges them with existing surfels. Frame indices $\{T+m\}_{m=1}^M$ are appended to surfels in these views, and novel views are stored, updating $\mathcal{V}^{(s)} \rightarrow \mathcal{V}^{(s+1)}$ and $\mathcal{S}^{(s)} \rightarrow \mathcal{S}^{(s+1)}$.
  • Figure 5: Long sequences with revisitations. We compare our VMem against a baseline without memory that relies solely on the last $K$ frames for context. Each sequence: input images (left), then generated images at selected frames. Our method (top rows) maintains consistency when revisiting observed regions, while the baseline (bottom rows) shows severe inconsistencies across extended sequences.
  • ...and 1 more figures