Table of Contents
Fetching ...

StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation

Shangjin Zhai, Zhichao Ye, Jialin Liu, Weijian Xie, Jiaqi Hu, Zhen Peng, Hua Xue, Danpeng Chen, Xiaomeng Wang, Lei Yang, Nan Wang, Haomin Liu, Guofeng Zhang

TL;DR

StarGen tackles long-range scene generation under compute limits by introducing a spatiotemporal autoregression framework that conditions each video clip on temporally overlapping frames and spatially adjacent views. A Large Reconstruction Model regresses depth and features from the spatial conditioning views, producing latent representations that, when combined with a temporal cue and processed through a pre-trained video diffusion model with ControlNet, yield pose-controlled, coherent sequences. The approach supports sparse view interpolation, perpetual view generation, and layout-conditioned city generation, with quantitative and qualitative results showing improved scalability, fidelity, and pose accuracy over state-of-the-art methods. This framework enables scalable, controllable long-range scene synthesis and sets the stage for more robust 3D-consistent content generation from sparse inputs, while highlighting areas for future work in loop handling and 3D reconstruction of generated content.

Abstract

Recent advances in large reconstruction and generative models have significantly improved scene reconstruction and novel view generation. However, due to compute limitations, each inference with these large models is confined to a small area, making long-range consistent scene generation challenging. To address this, we propose StarGen, a novel framework that employs a pre-trained video diffusion model in an autoregressive manner for long-range scene generation. The generation of each video clip is conditioned on the 3D warping of spatially adjacent images and the temporally overlapping image from previously generated clips, improving spatiotemporal consistency in long-range scene generation with precise pose control. The spatiotemporal condition is compatible with various input conditions, facilitating diverse tasks, including sparse view interpolation, perpetual view generation, and layout-conditioned city generation. Quantitative and qualitative evaluations demonstrate StarGen's superior scalability, fidelity, and pose accuracy compared to state-of-the-art methods. Project page: https://zju3dv.github.io/StarGen.

StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation

TL;DR

StarGen tackles long-range scene generation under compute limits by introducing a spatiotemporal autoregression framework that conditions each video clip on temporally overlapping frames and spatially adjacent views. A Large Reconstruction Model regresses depth and features from the spatial conditioning views, producing latent representations that, when combined with a temporal cue and processed through a pre-trained video diffusion model with ControlNet, yield pose-controlled, coherent sequences. The approach supports sparse view interpolation, perpetual view generation, and layout-conditioned city generation, with quantitative and qualitative results showing improved scalability, fidelity, and pose accuracy over state-of-the-art methods. This framework enables scalable, controllable long-range scene synthesis and sets the stage for more robust 3D-consistent content generation from sparse inputs, while highlighting areas for future work in loop handling and 3D reconstruction of generated content.

Abstract

Recent advances in large reconstruction and generative models have significantly improved scene reconstruction and novel view generation. However, due to compute limitations, each inference with these large models is confined to a small area, making long-range consistent scene generation challenging. To address this, we propose StarGen, a novel framework that employs a pre-trained video diffusion model in an autoregressive manner for long-range scene generation. The generation of each video clip is conditioned on the 3D warping of spatially adjacent images and the temporally overlapping image from previously generated clips, improving spatiotemporal consistency in long-range scene generation with precise pose control. The spatiotemporal condition is compatible with various input conditions, facilitating diverse tasks, including sparse view interpolation, perpetual view generation, and layout-conditioned city generation. Quantitative and qualitative evaluations demonstrate StarGen's superior scalability, fidelity, and pose accuracy compared to state-of-the-art methods. Project page: https://zju3dv.github.io/StarGen.
Paper Structure (13 sections, 7 equations, 6 figures, 4 tables)

This paper contains 13 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of the proposed method: (a) We introduce a spatiotemporal autoregression framework for long-range scene generation. The generated scene is represented as a set of sparsely sampled posed images. The generation of the current sliding window of images (blue dotted box) is conditioned on spatially adjacent images (green frustums) and temporally overlapping image (blue solid box). (b) Spatial conditioning images are processed by a large reconstruction model, which extracts the 3D information and renders the reconstructed latent features to each novel view. These spatial features, together with the temporal conditioning image, are used to condition the generation of the current window through a video diffusion model and a ControlNet. (c) The framework is used to implement three downstream tasks, including sparse view interpolation, perpetual view generation, and layout-conditioned city generation.
  • Figure 2: Spatiotemporal-Conditioned Video Generation. Given two posed images as spatial conditions (green dotted box on the left), the large reconstruction model regresses their depth maps and feature maps. The two feature maps $\mathbf{F}_{i_1}^\text{spat}$ and $\mathbf{F}_{i_2}^\text{spat}$ are rendered into novel views $\mathbf{F}^\text{nov}$ and temporally compressed to the latent space of CogVideoX, resulting in $\mathbf{z}^\text{spat}$. Simultaneously, the temporal conditioning image (blue dotted box on the right) is encoded to $\mathbf{z}^\text{temp}_k$ to replace the corresponding latent in $\mathbf{z}^\text{spat}$, resulting in the spatiotemporal condition $\mathbf{z}^\text{st}$, which conditions the generation of CogVideoX through a ControlNet.
  • Figure 3: Qualitative comparison of sparse view interpolation on the RealEstate-10K DBLP:journals/tog/ZhouTFFS18 test dataset under challenging scenario where the two input images have minimal or no overlap. In these situations, our method demonstrates better performance compared to other methods. We encourage readers to watch our supplementary video to better appreciate the differences.
  • Figure 4: Scalability comparison of perpetual view generation on long-range videos on the RealEstate-10K DBLP:journals/tog/ZhouTFFS18 test dataset. For a fair FID comparison across different desired numbers of frames, for each desired frame number $N$, we generate 5K$/N$ results for each method. Our method significantly outperforms existing methods in terms of both fidelity (a) and pose accuracy (b)(c).
  • Figure 5: Qualitative comparison of perpetual view generation on long-range videos on the RealEstate-10K DBLP:journals/tog/ZhouTFFS18 test dataset. While ViewCrafter exhibits significant degradation as the generated video becomes longer, our method is able to generate reasonable content throughout the entire sequence. We encourage readers to watch our supplementary video to better appreciate the differences.
  • ...and 1 more figures