Table of Contents
Fetching ...

EarthGen: Generating the World from Top-Down Views

Ansh Sharma, Albert Xiao, Praneet Rathi, Rohit Kundu, Albert Zhai, Yuan Shen, Shenlong Wang

TL;DR

EarthGen presents a scalable framework for infinite‑size, high‑resolution Earth imagery by fusing a base latent diffusion model with cascaded, scale‑aware super‑resolution and a tiling based mixture of diffusers. The approach achieves coherent, ultra‑high resolution terrain across large geographies while enabling interactive gigapixel exploration and downstream 3D scene generation. Key innovations include negative text conditioning to curb low‑quality outputs, a mixture of diffusers to enforce tiling continuity, and a training pipeline that jointly tunes a VAE, base LDM, and SR cascades on Bing Maps data. The results show substantial improvements over state‑of‑the‑art SR baselines in quality and realism, with practical applications in controllable world design, environmental analytics, and asset creation, and the work is open sourced for broad adoption.

Abstract

In this work, we present a novel method for extensive multi-scale generative terrain modeling. At the core of our model is a cascade of superresolution diffusion models that can be combined to produce consistent images across multiple resolutions. Pairing this concept with a tiled generation method yields a scalable system that can generate thousands of square kilometers of realistic Earth surfaces at high resolution. We evaluate our method on a dataset collected from Bing Maps and show that it outperforms super-resolution baselines on the extreme super-resolution task of 1024x zoom. We also demonstrate its ability to create diverse and coherent scenes via an interactive gigapixel-scale generated map. Finally, we demonstrate how our system can be extended to enable novel content creation applications including controllable world generation and 3D scene generation.

EarthGen: Generating the World from Top-Down Views

TL;DR

EarthGen presents a scalable framework for infinite‑size, high‑resolution Earth imagery by fusing a base latent diffusion model with cascaded, scale‑aware super‑resolution and a tiling based mixture of diffusers. The approach achieves coherent, ultra‑high resolution terrain across large geographies while enabling interactive gigapixel exploration and downstream 3D scene generation. Key innovations include negative text conditioning to curb low‑quality outputs, a mixture of diffusers to enforce tiling continuity, and a training pipeline that jointly tunes a VAE, base LDM, and SR cascades on Bing Maps data. The results show substantial improvements over state‑of‑the‑art SR baselines in quality and realism, with practical applications in controllable world design, environmental analytics, and asset creation, and the work is open sourced for broad adoption.

Abstract

In this work, we present a novel method for extensive multi-scale generative terrain modeling. At the core of our model is a cascade of superresolution diffusion models that can be combined to produce consistent images across multiple resolutions. Pairing this concept with a tiled generation method yields a scalable system that can generate thousands of square kilometers of realistic Earth surfaces at high resolution. We evaluate our method on a dataset collected from Bing Maps and show that it outperforms super-resolution baselines on the extreme super-resolution task of 1024x zoom. We also demonstrate its ability to create diverse and coherent scenes via an interactive gigapixel-scale generated map. Finally, we demonstrate how our system can be extended to enable novel content creation applications including controllable world generation and 3D scene generation.
Paper Structure (42 sections, 12 equations, 11 figures, 9 tables)

This paper contains 42 sections, 12 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Dataset Distributions: Zoom 20 data (15 cm/px) is mostly available only over America and Western Europe while Zoom 19 (30 cm/px) is available over the rest of the land in the world. High-resolution ocean tiles are typically unavailable.
  • Figure 2: Our pipeline consists of a base layer module, cascaded super-resolution modules, and Mixture of Diffusers tiling.
  • Figure 3: Pairwise win-rates between models based on user study. Each cell represents the row's win rate against the column.
  • Figure 4: Text conditioned base layer generation for input labels of "lake", "city", and "mountains" from left to right.
  • Figure 5: Sample Map-Conditioned Generations. Observe the model's ability to closely follow the map features while diversely filling in the details over different samples.
  • ...and 6 more figures