Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion

Zuoyue Li; Zhenqiang Li; Zhaopeng Cui; Marc Pollefeys; Martin R. Oswald

Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion

Zuoyue Li, Zhenqiang Li, Zhaopeng Cui, Marc Pollefeys, Martin R. Oswald

TL;DR

Sat2Scene tackles cross-view, city-scale urban scene generation from satellite imagery by integrating diffusion models into 3D sparse point-cloud representations and neural rendering. The pipeline first colorizes a foreground point cloud with a 3D diffusion model to produce $C ∈ [0,1]^{N×3}$ on geometry $P ∈ R^{N×3}$ and generates a background panorama $B ∈ R^{H_B×W_B×3}$ with a 2D diffusion model, followed by a feed-forward feature extraction $F = E(P,C)$ and volume rendering for arbitrary views. Key contributions include the first combination of diffusion with 3D sparse representations for direct satellite-to-scene generation, a point-anchored feature extraction strategy, and a neural rendering pipeline that yields photorealistic street-view videos with robust temporal consistency and cross-view generalization to OmniCity. Experiments on HoliCity show state-of-the-art temporal consistency and image quality, while generalization to OmniCity demonstrates robustness to new city-scale data, highlighting memory efficiency and scalability for outdoor scene synthesis.

Abstract

Directly generating scenes from satellite imagery offers exciting possibilities for integration into applications like games and map services. However, challenges arise from significant view changes and scene scale. Previous efforts mainly focused on image or video generation, lacking exploration into the adaptability of scene generation for arbitrary views. Existing 3D generation works either operate at the object level or are difficult to utilize the geometry obtained from satellite imagery. To overcome these limitations, we propose a novel architecture for direct 3D scene generation by introducing diffusion models into 3D sparse representations and combining them with neural rendering techniques. Specifically, our approach generates texture colors at the point level for a given geometry using a 3D diffusion model first, which is then transformed into a scene representation in a feed-forward manner. The representation can be utilized to render arbitrary views which would excel in both single-frame quality and inter-frame consistency. Experiments in two city-scale datasets show that our model demonstrates proficiency in generating photo-realistic street-view image sequences and cross-view urban scenes from satellite imagery.

Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion

TL;DR

on geometry

and generates a background panorama

with a 2D diffusion model, followed by a feed-forward feature extraction

and volume rendering for arbitrary views. Key contributions include the first combination of diffusion with 3D sparse representations for direct satellite-to-scene generation, a point-anchored feature extraction strategy, and a neural rendering pipeline that yields photorealistic street-view videos with robust temporal consistency and cross-view generalization to OmniCity. Experiments on HoliCity show state-of-the-art temporal consistency and image quality, while generalization to OmniCity demonstrates robustness to new city-scale data, highlighting memory efficiency and scalability for outdoor scene synthesis.

Abstract

Paper Structure (13 sections, 3 equations, 6 figures, 2 tables)

This paper contains 13 sections, 3 equations, 6 figures, 2 tables.

Introduction
Related work
Method
Generation
Rendering
Implementation details
Experiments
Configuration
Quantitative comparison
Qualitative comparison
Ablation study
Generalization
Conclusion

Figures (6)

Figure 1: Urban scenes generated by Sat2Scene. From a single satellite image covering urban streets, Sat2Scene is able to generate videos with photorealistic and consistent textures across different views.
Figure 2: Pipeline overview of our method. Three steps compose the full pipeline to generate the scene representation and render street views based on satellite-inferred geometries. The generation step initiates colors for the foreground point cloud by using a 3D diffusion model with sparse convolutions, as well as synthesizing the background panorama with a 2D diffusion model. The scene features tightly anchored with the point cloud are extracted at the feature extraction step. The final rendering step produces images from arbitrary views through neural rendering.
Figure 3: Visualization of the point resampling scheme. Yellow / purple color means high / low point confidence.
Figure 4: Qualitative baseline comparison on the HoliCity holicity dataset. Our method produces higher-quality video with better temporal consistency compared with the baselines.
Figure 5: Qualitative ablation study. We present exemplary qualitative results for various ablations of our method. The rendered images visibly contain more details and the depths are recovered better with our full method. The second line of each example shows the depth in pseudo colors, except the bottom left ones which are GT images.
...and 1 more figures

Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion

TL;DR

Abstract

Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (6)