Table of Contents
Fetching ...

Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context

JiaKui Hu, Jialun Liu, Liying Yang, Xinliang Zhang, Kaiwen Li, Shuang Zeng, Yuanwei Li, Haibin Huang, Chi Zhang, Yanye Lu

TL;DR

The camera gated attention module is developed to enhance the model's capability to effectively leverage camera poses, and the results show its superiority over previous approaches in maintaining scene consistency and camera control.

Abstract

Scene-consistent video generation aims to create videos that explore 3D scenes based on a camera trajectory. Previous methods rely on video generation models with external memory for consistency, or iterative 3D reconstruction and inpainting, which accumulate errors during inference due to incorrect intermediary outputs, non-differentiable processes, and separate models. To overcome these limitations, we introduce ``geometry-as-context". It iteratively completes the following steps using an autoregressive camera-controlled video generation model: (1) estimates the geometry of the current view necessary for 3D reconstruction, and (2) simulates and restores novel view images rendered by the 3D scene. Under this multi-task framework, we develop the camera gated attention module to enhance the model's capability to effectively leverage camera poses. During the training phase, text contexts are utilized to ascertain whether geometric or RGB images should be generated. To ensure that the model can generate RGB-only outputs during inference, the geometry context is randomly dropped from the interleaved text-image-geometry training sequence. The method has been tested on scene video generation with one-direction and forth-and-back trajectories. The results show its superiority over previous approaches in maintaining scene consistency and camera control.

Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context

TL;DR

The camera gated attention module is developed to enhance the model's capability to effectively leverage camera poses, and the results show its superiority over previous approaches in maintaining scene consistency and camera control.

Abstract

Scene-consistent video generation aims to create videos that explore 3D scenes based on a camera trajectory. Previous methods rely on video generation models with external memory for consistency, or iterative 3D reconstruction and inpainting, which accumulate errors during inference due to incorrect intermediary outputs, non-differentiable processes, and separate models. To overcome these limitations, we introduce ``geometry-as-context". It iteratively completes the following steps using an autoregressive camera-controlled video generation model: (1) estimates the geometry of the current view necessary for 3D reconstruction, and (2) simulates and restores novel view images rendered by the 3D scene. Under this multi-task framework, we develop the camera gated attention module to enhance the model's capability to effectively leverage camera poses. During the training phase, text contexts are utilized to ascertain whether geometric or RGB images should be generated. To ensure that the model can generate RGB-only outputs during inference, the geometry context is randomly dropped from the interleaved text-image-geometry training sequence. The method has been tested on scene video generation with one-direction and forth-and-back trajectories. The results show its superiority over previous approaches in maintaining scene consistency and camera control.
Paper Structure (19 sections, 9 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 19 sections, 9 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Teaser demonstration. We introduce Geometry-as-Context (GaC), a framework that leverages explicit 3D information into reconstruction-based scene video generation. GaC mitigates cumulative errors from non-differentiable reconstruction and non-end-to-end training pipelines. Furthermore, GaC enhances the 3D consistency and long-term 3D memory of generative video models. We showcase GaC under four settings:on outdoor, indoor, in-the-wild, and forth-and-back camera trajectory. GaC maintains consistency under cyclic motion: even when an object (e.g., a computer) disappears in the 32-nd frame of the last row, it is faithfully restored in later frames.
  • Figure 2: Reconstruction-based scene video generation (a) v.s. our geometry-as-context (GaC) (b). Reconstruction-based scene video generation uses non-differentiable operators in reconstruction, which tend to worsen cumulative errors caused by inaccurate geometry estimates or image inpainting. In contrast, GaC replaces these operations with camera-controllable generation, turning reconstruction-based scene video generation into an autoregressive video generation framework with one single DiT. It can effectively reduce cumulative errors caused by non-differentiable reconstruction and non-end-to-end training.
  • Figure 3: Detailed architecture of geometry-as-context.
  • Figure 4: Qualitative results of scene video generation from single view. Compared to the baselines, our model generates more consistent novel views. These images are the 20-th frame of the generated video clip except the input one. For a clearer visualization, please zoom in.
  • Figure 5: Gac's results on indoor scenes.
  • ...and 4 more figures