Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion
Boyang Deng, Richard Tucker, Zhengqi Li, Leonidas Guibas, Noah Snavely, Gordon Wetzstein
TL;DR
Streetscapes tackles the challenge of generating long-range, city-scale street views that remain visually coherent over extended trajectories. It combines layout-conditioned two-frame diffusion with an autoregressive temporal imputation mechanism, enabling scalable, multi-view consistency without retraining for long sequences. The system is trained on Google Street View data paired with map layouts (street maps and height maps) and supports text-driven style control and geographic style transfer. Across long-range and perpetual generation tasks, Streetscapes achieves higher fidelity and stability than baselines, with practical applications for virtual navigation, VR/AR, and 3D reconstruction.
Abstract
We present a method for generating Streetscapes-long sequences of views through an on-the-fly synthesized city-scale scene. Our generation is conditioned by language input (e.g., city name, weather), as well as an underlying map/layout hosting the desired trajectory. Compared to recent models for video generation or 3D view synthesis, our method can scale to much longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency. To achieve this goal, we build on recent work on video diffusion, used within an autoregressive framework that can easily scale to long sequences. In particular, we introduce a new temporal imputation method that prevents our autoregressive approach from drifting from the distribution of realistic city imagery. We train our Streetscapes system on a compelling source of data-posed imagery from Google Street View, along with contextual map data-which allows users to generate city views conditioned on any desired city layout, with controllable camera poses. Please see more results at our project page at https://boyangdeng.com/streetscapes.
