Table of Contents
Fetching ...

Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion

Boyang Deng, Richard Tucker, Zhengqi Li, Leonidas Guibas, Noah Snavely, Gordon Wetzstein

TL;DR

Streetscapes tackles the challenge of generating long-range, city-scale street views that remain visually coherent over extended trajectories. It combines layout-conditioned two-frame diffusion with an autoregressive temporal imputation mechanism, enabling scalable, multi-view consistency without retraining for long sequences. The system is trained on Google Street View data paired with map layouts (street maps and height maps) and supports text-driven style control and geographic style transfer. Across long-range and perpetual generation tasks, Streetscapes achieves higher fidelity and stability than baselines, with practical applications for virtual navigation, VR/AR, and 3D reconstruction.

Abstract

We present a method for generating Streetscapes-long sequences of views through an on-the-fly synthesized city-scale scene. Our generation is conditioned by language input (e.g., city name, weather), as well as an underlying map/layout hosting the desired trajectory. Compared to recent models for video generation or 3D view synthesis, our method can scale to much longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency. To achieve this goal, we build on recent work on video diffusion, used within an autoregressive framework that can easily scale to long sequences. In particular, we introduce a new temporal imputation method that prevents our autoregressive approach from drifting from the distribution of realistic city imagery. We train our Streetscapes system on a compelling source of data-posed imagery from Google Street View, along with contextual map data-which allows users to generate city views conditioned on any desired city layout, with controllable camera poses. Please see more results at our project page at https://boyangdeng.com/streetscapes.

Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion

TL;DR

Streetscapes tackles the challenge of generating long-range, city-scale street views that remain visually coherent over extended trajectories. It combines layout-conditioned two-frame diffusion with an autoregressive temporal imputation mechanism, enabling scalable, multi-view consistency without retraining for long sequences. The system is trained on Google Street View data paired with map layouts (street maps and height maps) and supports text-driven style control and geographic style transfer. Across long-range and perpetual generation tasks, Streetscapes achieves higher fidelity and stability than baselines, with practical applications for virtual navigation, VR/AR, and 3D reconstruction.

Abstract

We present a method for generating Streetscapes-long sequences of views through an on-the-fly synthesized city-scale scene. Our generation is conditioned by language input (e.g., city name, weather), as well as an underlying map/layout hosting the desired trajectory. Compared to recent models for video generation or 3D view synthesis, our method can scale to much longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency. To achieve this goal, we build on recent work on video diffusion, used within an autoregressive framework that can easily scale to long sequences. In particular, we introduce a new temporal imputation method that prevents our autoregressive approach from drifting from the distribution of realistic city imagery. We train our Streetscapes system on a compelling source of data-posed imagery from Google Street View, along with contextual map data-which allows users to generate city views conditioned on any desired city layout, with controllable camera poses. Please see more results at our project page at https://boyangdeng.com/streetscapes.
Paper Structure (34 sections, 3 equations, 10 figures, 3 tables)

This paper contains 34 sections, 3 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Layout-conditioned Scene Generation. Using the input scene layout (overhead street map and height map), we render two geometry buffers (G-buffers), $G^{(i)}$ and $G^{(i+1)}$, that contain semantic labels encoded in an RGB image as well as image-space disparity maps and height maps, for camera poses $C^{(i)}$ and $C^{(i+1)}$. These G-buffers condition a motion-aware latent diffusion model that generates a pair of images. Orange and purple boxes illustrate spatial and temporal layers, respectively.
  • Figure 2: Autoregressive Video Diffusion. The Streetscapes system generates a sequence of consistent frames along a desired camera trajectory. Consistency is achieved by generating the first $2$ frames jointly using parallel denoising, then generating each subsequent frame via temporal imputation, guided by the previous frame in an autoregressive manner. Both procedures use the same model, but with different reverse diffusion formulations.
  • Figure 3: Long-range Consistent Street View Generation. We compare Streetscapes to a state-of-the-art street view generation approach, InfiniCity lin2023infinicity, on the task of generating consistent views on long paths through large-scale street scenes. In this task, we let Streetscapes generate consistent views autoregressively following various camera tracks. We find that Streetscapes consistently generates higher-fidelity street views that are noticeably more realistic than InfiniCity's results.
  • Figure 4: Long-range Street Cruise. While alternative autoregressive generation methods suffer from severe quality degradation after $32$ frames (see Fig. \ref{['fig:autoreg_syn']} and Tab. \ref{['tbl:autoreg_syn']}), our Streetscapes system can generate long street cruises without noticeable drift in quality for $100$ frames along paths of over $170$ meters spanning multiple city blocks. Note how our results are consistent with the specified scene layout and camera poses (illustrated on the maps). In addition, in contrast to prior methods that support either forward-only liu2021infinite or backward-only SceneScape camera motion, our system allows for flexible camera control wherein the user can freely move and turn the camera.
  • Figure 5: Perpetual Street View Generation. We compare Streetscapes with state-of-the-art autoregressive view generation methods, including InfiniteNature-Zero li2022_infinite_nature_zero (using either monocular depth, i.e. InfNat0-mono, or proxy depth, i.e. InfNat0-proxy), DiffDreamer cai2022diffdreamer, and an autoregressive variant of Zero123 liu2023zero, i.e. Zero123-A. In this task, we are given an initial input image and a camera track. Each method aims to generate a consistent street view following the camera track autoregressively. We pick generation step-$1$ to demonstrate degradation-free generation quality, as well as step-$20$ and step-$40$ for long-range generation quality. Note how the results of our Streetscapes method remain highly realistic, while those of other methods degrade significantly.
  • ...and 5 more figures