Table of Contents
Fetching ...

MemCam: Memory-Augmented Camera Control for Consistent Video Generation

Xinhang Gao, Junlin Guan, Shuhan Luo, Wenzhuo Li, Guanghuan Tan, Jiacheng Wang

Abstract

Interactive video generation has significant potential for scene simulation and video creation. However, existing methods often struggle with maintaining scene consistency during long video generation under dynamic camera control due to limited contextual information. To address this challenge, we propose MemCam, a memory-augmented interactive video generation approach that treats previously generated frames as external memory and leverages them as contextual conditioning to achieve controllable camera viewpoints with high scene consistency. To enable longer and more relevant context, we design a context compression module that encodes memory frames into compact representations and employs co-visibility-based selection to dynamically retrieve the most relevant historical frames, thereby reducing computational overhead while enriching contextual information. Experiments on interactive video generation tasks show that MemCam significantly outperforms existing baseline methods as well as open-source state-of-the-art approaches in terms of scene consistency, particularly in long video scenarios with large camera rotations.

MemCam: Memory-Augmented Camera Control for Consistent Video Generation

Abstract

Interactive video generation has significant potential for scene simulation and video creation. However, existing methods often struggle with maintaining scene consistency during long video generation under dynamic camera control due to limited contextual information. To address this challenge, we propose MemCam, a memory-augmented interactive video generation approach that treats previously generated frames as external memory and leverages them as contextual conditioning to achieve controllable camera viewpoints with high scene consistency. To enable longer and more relevant context, we design a context compression module that encodes memory frames into compact representations and employs co-visibility-based selection to dynamically retrieve the most relevant historical frames, thereby reducing computational overhead while enriching contextual information. Experiments on interactive video generation tasks show that MemCam significantly outperforms existing baseline methods as well as open-source state-of-the-art approaches in terms of scene consistency, particularly in long video scenarios with large camera rotations.

Paper Structure

This paper contains 14 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Methodology. (Left) Overview of MemCam: the Context Compressor encodes historical frames selected via co-visibility into compact representations, which are concatenated with the noisy prediction sequence and fed into the DiT Block. (Right) Illustration of co-visibility computation between predicted and historical camera FOVs.
  • Figure 2: Qualitative Comparison Results. (a) and (b) are evaluated on the Context-as-Memory dataset, and (c) and (d) on RealEstate10K. MemCam achieves superior performance in scene memory retention and overall generation quality. In contrast, other methods exhibit varying degrees of scene inconsistency due to insufficient utilization of contextual information.